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Abstract 


Parallel  programming  requires  task  scheduling  to  optimize  performance;  this  primarily  involves 
balancing  the  load  over  the  processors.  In  many  cases,  it  is  critical  to  perform  task  scheduling 
at  runtime.  For  example,  (1)  in  many  parallel  applications  the  task  load  cannot  be  accurately 
predicted  a  priori-,  (2)  in  a  network-based  multicomputer  the  computational  power  of  each  pro¬ 
cessor  may  not  remain  constant.  In  order  to  support  dynamic  task  scheduling,  the  programmer 
usually  needs  to  design  and  implement  a  complex  set  of  scheduling  routines,  e.g.,  routines 
for  maintaining  task  lists  and  handling  interprocessor  communication  for  load  balancing.  Un¬ 
fortunately,  it  is  very  difficult  and  time-consuming  to  write  and  debug  all  of  these  scheduling 
routines. 

This  thesis  proposes  a  new  approach  which  can  greatly  reduce  the  effort  of  developing 
efficient  dynamic  task  scheduling  routines.  In  our  new  approach,  we  decompose  task  scheduling 
into  two  parts  —  the  specification  of  scheduling  policies  and  the  implementation  of  supportive 
scheduling  operations  —  and  then  hide  the  latter  from  the  programmer.  We  call  this  approach 
multilist  scheduling,  because  it  is  based  on  a  uniform  scheduling  model  involving  the  use  of 
multiple  scheduling  lists. 

This  thesis  analyzes  three  main  features  of  the  new  multilist  scheduling  model:  ease  of  use, 
generality,  and  efficiency. 

•  Ease  of  use;  Programmers  only  need  to  specify  scheduling  policies,  not  the  details  of 
supportive  scheduling  routines.  Typically,  this  involves  writing  only  tens  of  lines  of  C 
code,  as  opposed  to  thousands  of  lines  of  code  for  the  supportive  scheduling  routines. 

•  Generality:  We  show  that  this  model  results  in  no  loss  of  generality.  We  also  illustrate  the 
generality  of  the  model  by  rendering  several  scheduling  algorithms  in  the  framework  of 
our  model,  including  the  scheduling  algorithms  for  parallel  divide-and-conquer  (D&C) 
and  best-first  search  (BFS). 

•  Efficiency:  We  propose  some  efficient  techniques  for  implementing  scheduling  lists,  and 
also  show  that  our  general  approach  incurs  no  significant  performance  overhead,  at  least 
for  the  parallel  D&C  and  BFS  scheduling  algorithms.  In  addition,  we  also  demon.strate 
good  performance  results  for  some  applications  that  are  based  on  parallel  D&C  and  BFS, 
such  as  the  set  covering  problem. 


Traditionally,  it  has  been  difficult  to  efficiently  support  both  parallel  D&C  and  BPS  in  a 
uniform  framework.  We  believe  that  our  system  is  the  first  system  that  do  so. 

Multilist  scheduling  is  the  first  model  which  can  hide  the  details  of  dynamic  task  scheduling 
routines  while  supporting  general  task  scheduling.  We  expect  that  this  model  will  have  a 
significant  impact  on  parallel  programming,  especially  in  the  domain  of  multicomputers. 
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Chapter  1 


Introduction 


Parallel  programming  requires  task  scheduling  to  optimize  performance;  this  primarily  involves 
balancing  the  load  over  the  processors.  Task  scheduling  can  be  classified  into  static  task 
scheduling  ot  dynamic  task  scheduling,  according  to  the  time  when  task  scheduling  is  performed. 
Static  task  scheduling  is  typically  done  by  distributing  the  load  of  tasks  evenly  (or  roughly 
evenly)  over  processors  at  the  beginning  of  the  job.  This  can  be  done  with  the  help  of  compilers 
[39,  80,  100].  However,  in  many  cases,  it  is  critical  to  perform  task  scheduling  at  runtime: 


•  In  many  parallel  applications,  the  task  load  cannot  be  accurately  predicted  a  priori.  For 
example,  in  mathematical  optimization  problems  [35,  76],  which  usually  involve  large- 
scale  tree  searching,  it  is  impossible  to  make  useful  a  priori  estimates  on  the  size  of  the 
search  tree  or  the  sizes  of  its  nodes. 

•  In  network-based  multicomputers\  the  computational  power  of  each  processor  may 
not  remain  constant.  For  example,  in  a  network-based  multicomputer,  someone  may 
unexpectedly  come  to  work  on  one  of  the  participating  workstations,  slowing  it  down. 
Static  task  scheduling  (or  static  load  balancing)  cannot  balance  the  load  well  in  such  a 
situation. 

'a  network-based  multicomputer  is  a  number  of  computers  (e.g.,  workstations),  connected  via  a  network, 
cooperating  on  the  same  Job.  As  networks  have  grown  more  efficient,  network-based  multicomputers  [61]  have 
recently  emerged  as  a  new  and  attractive  type  of  parallel  system  due  to  resource  sharing  which  results  in  flexibility 
and  low  cost.  We  expect  more  and  more  applications  to  run  on  such  multicomputers. 
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We  will  first  describe  the  traditional  approach  to  dynamic  task  scheduling  (or  dynamic  load 
balancing)  and  point  out  the  problem  of  the  approach  in  Section  1.1.  Then,  we  will  briefly 
describe  our  new  approach  in  Section  1.2.  Finally,  we  will  give  an  overview  of  this  thesis  in 
Section  1.3. 


1.1  Traditional  Approach 


The  traditional  approach  to  dynamic  scheduling  is  ad  hoc:  for  each  application  or  each  class 
of  applications,  a  task  scheduling  algorithm  is  implemented  from  scratch.  For  example; 


•  For  a  large  class  of  applications  which  can  be  solved  without  concern  for  the  scheduling 
sequence  of  executable  tasks,  many  researchers  [31,  67,  89,  107]  have  proposed  various 
load  balancing  techniques  to  efficiently  parallelize  the  applications. 

•  For  tree  search  problems  (e.g.,  divide-and-conquer  {D&Q  problems)  which  can  be 
efficiently  solved  by  using  depth-first  search  (DFS),  many  researchers  [32,  33,  34.  44, 
82,  104,  105, 108]  have  proposed  scheduling  algorithms  which  can  reduce  the  amount  of 
communication  while  balancing  the  load. 

•  For  tree  search  problems  which  can  be  efficiently  solved  by  using  best-first  search  {BFS), 
sev,  il  researchers  [1,  5,  46,  60,  76,  81,  101,  108]  have  proposed  ways  to  parallelize 
BFS. 

•  For  game  tree  search  problems  which  can  be  solved  by  using  alpha-beta  (rv-.:i)  search, 
researchers  have  proposed  many  different  kinds  of  parallel  scheduling  algorithms,  e.g., 
the  mandatory  work  first  algorithm  [4],  the  principle  variation  splitting  {PVS)  algorithm 
[21,  69],  and  other  variations  [1 1, 45,  87]. 

•  For  scientific  applications  with  parallel  loops,  researchers  have  proposed  efficient  runtime 
scheduling  algorithms  using  the  following  techniques:  factoring  scheduling  [48],  guided 
seif-scheduling  [78],  and  phase-based  scheduling  [70]. 


To  facilitate  comparison  with  our  new  approach,  we  present  the  traditional  approach  in  ’ 
terms  of  programming  layers,  as  illustrated  in  Figure  1.1.  We  separ  ite  application  programs  and 
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Parallel  Loops 


BPS  D&C  a-p 


f 

Application  Layer 

■ 

■ 

■ 

■ 

Scheduling  Layer 

Network  Interface  (e.g.,  PVM)  Network  Interface  Layer 

Figure  1.1:  Three  parallel  programming  layers  in  the  traditional  approach. 


scheduling  programs  into  two  different  programming  layers,  the  application  layer  (high  level) 
and  the  scheduling  layer  (low  level),  respectively.  The  application  layer  is  for  applications; 
the  scheduling  layer  is  for  parallel  scheduling  algorithms,  with  each  algorithm  supporting 
one  application  or  a  class  of  applications.  Programmers  in  the  application  layer  are  called 
application  programmers  (they  are  not  expected  to  design  parallel  scheduling  algorithms); 
programmers  in  the  scheduling  layer  are  called  scheduling  programmers.  In  addition  to  the 
above  two  layers,  the  network  interface  layer  provides  a  general  mechanism  for  network 
communication  and  supports  an  interface  for  the  scheduling  layer  to  access  the  network  system. 
For  example,  PVM  [37],  Express  [49],  iPSC  primitives[50].  Nectarine  [95],  and  socket  packages 
for  TCP/IP  [65]  would  reside  in  the  network  interface  layer.  Programmers  in  this  layer  are  called 
system  designers.  Note  that  in  the  rest  of  this  thesis  when  we  refer  to  a  generic  “programmer”, 
we  will  mean  the  scheduling  pu  glimmer. 

The  traditional  approach  has  a  serious  problem:  it  requires  a  large  effort  to  implement  any  of 
these  dynamic  scheduling  algorithms.  In  order  to  implement  a  dynamic  scheduling  algorithm, 
the  programmer  usually  needs  to  write  the  details  of  supportive  scheduling  routines,  e.g, 
those  for  maintaining  task  lists  and  handling  interprocessor  communication  for  load  balancing. 
Unfortunately,  it  is  very  difficult  and  time-consuming  to  write  and  to  debug  these  supportive 
routines.  (Note  that  concurrent  debugging,  in  particular,  is  clearly  much  more  difficult  than 
sequential  debugging,  due  to  nondeterm’*  '  For  example,  when  we  were  parallelizing  a 
solid  modeling  program  (called  Noodles  [23])  based  on  a  simple  load  balancing  strategy  [31  ], 
it  took  us  months  to  write  thousands  of  lines  of  load-balancing  code  in  C.  From  this,  we 
understand  that  it  is  extremely  important  to  provide  a  general  scheduling  system  which  can 
hide  these  supportive  scheduling  operations  from  the  programmer  such  that  programmers  only 
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need  to  focus  on  the  specification  of  the  scheduling  policies. 


1.2  Our  New  Approach 


Recently,  researchers  have  begun  to  notice  that  these  supportive  scheduling  operations  are  very 
similar  among  many  scheduling  algorithms.  In  response,  some  systems  have  begun  to  provide 
greater  generality  with  high-level  interfaces.  For  tree  search  problems,  Wu  [103]  proposed 
a  parallel  programming  system,  called  dual-priority  task  scheduling  (an  early  version  of  our 
present  system),  which  supports  flexible  scheduling  algorithms  for  most  tree  search  problems. 
In  addition,  Nishikawa  and  Steenkiste  [70,  71]  proposed  the  Aroma  language  (mainly  based 
on  phase-scheduling)  which  offers  a  range  of  scheduling  algorithms  to  cover  many  scientific 
applications.  Still,  none  of  the  above  have  been  claimed  to  be  general. 

BFS  D&C  a-3 


Figure  1.2:  Four  parallel  programming  layers  in  the  traditional  approach. 


In  this  thesis,  we  will  propose  a  general  approach  which  can  greatly  reduce  the  effort  of 
developing  dynamic  scheduling  routines.  In  our  new  approach,  we  decompose  task  scheduling 
into  two  parts  —  the  specification  of  scheduling  policies  and  the  implementation  of  supportive 
scheduling  operations  —  and  then  hide  the  latter  from  the  programmer.  We  call  this  approach 
multilist  scheduling-,  because  it  is  based  on  a  uniform  scheduling  model  involving  the  use  of 

^In  the  area  of  compiler  design,  list  scheduling  [2,  59)  is  a  common  technique  in  which  a  heuristic  function  is 
used  to  indicate  the  execution  sequence  of  tasks  (or  instructions)  in  a  scheduling  list.  Our  system  is  similar  to  li.st 
scheduling  in  this  sense.  However,  the  list  scheduling  technique  used  by  compilers  usually  sorts  the  scheduling 
list  before  scheduling  tasks  from  it,  while  our  multilist  scheduling  system  inserts  tasks  into  or  deletes  tasks  from 
scheduling  lists  at  runtime. 
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multiple  scheduling  lists.  In  the  multilist  scheduling  model,  programmers  only  need  to  specify 
scheduling  policies  based  on  scheduling  lists.  The  supportive  routines  for  the  implementation  of 
scheduling  lists  are  moved  into  the  system  and  hidden  from  the  programmer.  These  supportive 
routines  from  the  scheduling  layer  form  a  new  layer  called  the  scheduling  support  layer  as 
shown  in  Figure  1.2.  In  the  past,  the  scheduling  programmers  had  to  write  the  supportive 
routines;  now,  this  will  be  the  responsibility  of  the  system  designer. 


1.3  Overview  of  This  Thesis 


Chapter  2  will  describe  our  multilist  scheduling  model  in  greater  detail  and  will  also  show  the 
generality  of  the  model.  Chapter  3  will  illustrate  the  generality  and  simplicity  of  the  model 
by  implementing  several  interesting  scheduling  algorithms,  such  as  parallel  D&C  and  BFS, 
based  on  the  model.  Chapter  4  will  describe  the  current  implementation  and  will  show  that 
our  general  approach  incurs  no  significant  performance  overhead  at  least  in  these  two  cases. 
Chapter  5  will  present  some  theoretical  results  which  we  have  obtained  while  developing  our 
model.  Since  these  theoretical  results  have  no  strong  relation  to  the  rest  of  this  thesis,  the  reader 
may  skip  this  chapter  without  loss  of  continuity.  Chapter  6  will  demonstrate  good  performance 
results  for  some  applications  that  are  based  on  the  parallel  D&C  and  the  parallel  BFS  scheduling 
algorithms,  such  as  the  set  covering  problem.  Chapter  7  will  give  our  conclusions.  Appendix 
A  will  define  the  user  interface  for  our  multilist  scheduling  model. 
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Chapter  2 


Multilist  Scheduling 


In  Section  2.1,  we  will  define  a  computational  model  for  task  scheduling.  In  Section  2.2,  we 
will  give  a  general  introduction  to  the  multilist  scheduling  approach.  The  model  based  on  this 
new  approach  will  be  formally  defined  in  Section  2.3.  Finally,  we  will  discuss  this  model  in 
Section  2.4. 


2.1  Computational  Model  for  Task  Scheduling 


Figure  2. 1 :  Tasks  and  messages. 
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The  computational  model  for  task  scheduling  has  two  basic  elements;  tasks  and  messages. 
Tasks  are  the  basic  program  units  that  run  concurrently;  a  message  is  a  piece  of  data  sent  from 
one  task  to  another  task.  Figure  2. 1  illustrates  an  example.  The  computational  model  obeys 
the  following  rules. 


1 .  The  system  creates  a  task  after  all  the  expected  messages  describing  the  task  have  been 
received.  These  messages  will  be  subsequently  consumed. 

2.  After  all  the  expected  messages  have  been  received,  a  task  ignores  further  messages. 

3.  A  processor  completes  a  task  before  switching  to  another  task. 

4.  A  task  can  generate  zero  or  more  messages.  Note  that  for  simplicity  creating  messages 
only  happens  at  the  end  of  the  task’s  computation. 

5.  A  task  frees  itself  immediately  before  it  is  terminated. 


The  third  rule  implies  that  the  entire  computation  of  a  task  Is  executed  sequentially  on 
a  processor,  and  that  tasks  cannot  preempt  each  other.  Therefore,  if  parallel  computation  is 
desired  within  a  task,  then  the  task  should  explicitly  be  decomposed  into  parallel  subtasks.  It 
also  follows  that  this  model  cannot  handle  applications  that  depends  on  preemption,  such  as 
certain  real-time  applications. 

Consider  an  example  of  parallel  programming  which  requires  the  following  primitives:  ( 1 ) 
fork  a  thread,  (2)  send  data  to  a  thread,  and  (3)  receive  data  in  a  thread.  We  will  show  that  these 
primitives  can  be  represented  in  the  above  computational  model  as  follows. 


Fork  a  thread.  Create  a  message  which  subsequently  creates  a  task  corresponding  to  the 
thread.  For  example,  in  Figure  2. 1 ,  task  T3  representing  some  thread  can  fork  another 
thread  by  creating  a  message,  say  M4',  then,  message  M4  will  in  turn  create  task  T$, 
corresponding  to  the  new  thread. 

Send  data  to  a  thread.  Create  a  corresponding  message  (the  delivery  and  the  destination  are 
part  of  the  message).  For  example,  in  Figure  2.1,  task  Ti  sends  a  message  ,V1:  to  a  thread 
represented  by  task  T3. 
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Receive  data  in  a  thread.  Create  a  message  M  (containing  the  content  of  the  whole  thread) 
for  continuation  of  the  thread;  terminate  the  current  task  (corresponding  to  the  current 
thread);  when  a  send  message  corresponding  to  the  receive  exists,  create  a  new  task  from 
the  send  message  and  M  (i.e.,  just  resume  the  execution  of  the  original  thread).  For 
example,  in  Figure  2.1,  when  task  Tj  representing  some  thread  wants  to  receive  message 
M2,  the  task  will  create  message  M 1  for  continuation  of  the  thread.  Then,  when  message 
M2  is  created,  both  M  \  and  M2  will  create  a  new  task  T3  which  resumes  the  execution 
of  the  original  thread. 


In  the  rest  of  this  thesis,  we  will  assume  that  a  thread  can  receive  messages  at  any  time 
during  its  execution,  but  that  a  task  can  receive  messages  only  at  the  beginning  of  its  execution. 
So,  each  segment  of  a  thread  between  the  receipt  of  messages  corresponds  to  a  complete  task. 

In  order  to  implement  task  scheduling,  the  programmer  needs  to  write  a  procedure,  called 
the  task  scheduler.  Whenever  a  processor  is  idle,  the  system  applies  the  task  scheduler  in  order 
to  schedule  a  new  task,  based  on  the  state  of  the  whole  system  (tasks,  messages,  and  processors). 
If  more  than  one  processor  tries  to  schedule  tasks  simultaneously,  only  one  processor  can  apply 
the  scheduling  operation  at  a  time.  In  other  words,  this  is  an  atomic  operation. 


Figure  2.2:  Computational  model  for  task  scheduling. 

This  model  for  task  scheduling  will  be  called  the  standard  scheduling  model.  It  is  illustrated 

in  Figure  2.2  with  p  processors,  denoted  by  P .  Pp.  In  accordance  with  the  definitions  of 

scheduling  programmer  and  application  programmer  in  Section  1 . 1 ,  the  scheduling  programmer 
is  responsible  for  the  implementation  of  the  task  scheduler,  and  the  application  programmer  is 


responsible  for  the  implementation  of  tasks  and  messages.  In  the  rest  of  this  chapter,  we  will 
develop  the  model  for  the  design  of  the  task  scheduler. 


2.2  General  Approach 


On  a  single  processor,  a  task  scheduling  sequence  is  basically  a  list  of  tasks  ordered  according 
to  their  priorities,  the  preferences  of  scheduling  these  tasks.  We  call  such  a  list  a  scheduling 
list.  In  a  parallel  system,  extending  the  above  paradigm,  task  scheduling  requires  the  use  of 
multiple  scheduling  lists  and  multiple  priority  assignments  per  task.  We  will  illustrate  this 
necessity  with  the  following  example. 


Synchronous  Network  Simulation 

Network  simulation  is  a  common  computational  paradigm,  in  which  data  dependencies  can 
be  described  by  a  network  or  graph.  Examples  of  network  simulation  computations  include 
finite  element  simulation  such  as  fluid  simulation  [53];  differential  equation  solving  such  as 
weather  prediction  [24,  25];  digital  circuit  simulation  such  as  gate-level  simulation  [20);  and 
digital  signal  processing  such  as  sonar  detection  [80]. 

A  network  simulation  computation  can  be  represented  by  a  directed  graph,  in  which  a  node 
represents  a  thread  and  an  edge  represents  the  data  dependency  between  threads.  If  there  is  an 
edge  from  node  A  to  node  B,  node  A  needs  to  send  data  to  node  B  at  the  end  of  each  phase.  We 
will  call  this  edge  an  out-edge  for  A  and  an  in-edge  for  B. 

A  synchronous  network  simulation  (SNS)  computation  is  a  network  simulation  computation 
in  which  all  nodes  need  to  be  synchronized  in  each  phase.  (Note  that  an  asynchronous 
network  simulation  computation  is  a  network  simulation  computation  without  the  constraint 
of  synchronization;  we  will  discuss  this  case  in  Section  3.2.4.)  For  SNS,  each  node  will  be 
executed  once  during  each  phase.  Then,  each  node  will  send  data  to  neighboring  nodes  via  the 
out-edges  in  the  graph.  The  SNS  computation  can  advance  to  the  next  stage  when  all  the  nodes 
have  received  data  from  their  neighboring  nodes  via  in-edges.  (From  the  task  definition  in  the 
previous  section,  each  node  during  one  phase  can  be  considered  to  be  a  task.) 
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The  traditional  approach  for  SNS  is  to  partition  the  SNS  graph  over  processors  by  hand  [62, 
Section  4]  or  by  compilers  [80].  However,  if  the  grain  size  of  each  node  (in  a  phase)  cannot 
be  known  a  priori,  it  becomes  critical  to  partition  the  graph  over  processors  at  runtime.  Here, 
we  propose  a  dynamic  load  balancing  strategy  based  on  the  criterion  of  keeping  the  number  of 
cross  edges  as  small  as  possible.  A  cross  edge  is  an  edge  between  two  nodes,  which  reside  on 
different  processors. 


Data  dependency 


Figure  2.3:  Partitioning  an  SNS  graph  over  four  processors. 

Let  us  consider  an  SNS  computation,  illustrated  in  Figure  2.3,  at  some  moment  of  a  phase. 
Its  SNS  graph  is  partitioned  over  four  processors.  Pi ,  Pi,  P3,  and  P4  ( the  partitioning  is  indicated 
by  the  thick  lines  in  this  figure  and  nodes  are  represented  by  circles).  For  the  node  marked  with 
“x”,  each  processor  has  a  different  preference  to  schedule  this  node:  Processor  P|  has  a  very 
high  preference  to  schedule  the  node  because  it  is  always  good  to  schedule  local  nodes  before 
requesting  nodes  from  other  processors.  Processor  Pj  has  a  very  low  preference  to  schedule 
the  node  because  moving  the  task  to  P4  will  incur  the  expense  of  many  more  cross  edges. 
Similarly,  processors  Pi  and  P3  have  in-between  preferences  of  scheduling  the  node.  One  can 
also  apply  the  same  strategy  to  other  nodes  to  find  the  relative  priority  of  scheduling  any  task 
on  any  processor.  This  shows  that  each  processor  can  have  its  own  perspective  of  the  preferred 
task  scheduling  sequence.  This  suggests  that  3.  general  scheduling  model  should  at  least  allow 
each  processor  to  express  its  own  perspective  of  the  task  scheduling  sequence. 

p-List  Model 
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Figure  2.4:  p-iist  model. 

In  a  parallel  system  with  p  processors  (denoted  by  P\ , ....  Pp),  since  each  processor  may  have 
its  own  perspective  of  the  task  scheduling  sequence,  we  assign,  in  our  approach,  a  scheduling 
list  to  each  processor.  Since  there  are  p  scheduling  lists  in  all,  we  call  the  model  the  p-list 
model.  In  this  model,  the  programmer  only  needs  to  assign  to  a  task  p  priorities,  (T|,  . . ., 

TTp),  with  each  representing  the  priority  of  the  task  in  the  scheduling  list  of  P,.  Figure  2.4 
illustrates  the  p-list  model  with  five  tasks  denoted  by  T|,  Ti,  T3,  and  T5.  In  this  figure,  the 
small-font  number  in  the  right  hand  side  of  each  task  of  each  list  represents  the  priority'  of 
the  task  in  the  list.  For  example,  for  task  T|,  its  jti  =  2,  =  7,  and  ffp  =  7.  When  a  task 

is  created,  it  is  simultaneously  inserted  into  each  P,’s  list  according  to  the  task’s  priority 
(Note  that  if  a  task  is  absolutely  not  scheduled  by  some  processor,  say  P,,  then  we  can  set  the 
priority  tt,  of  the  task  to  an  undefined  value  and  do  not  insert  the  task  into  the  corresponding 
scheduling  list.)  If  some  processor  schedules  a  task  (with  the  highest  priority)  from  the  head 
of  its  scheduling  list  for  execution,  say  processor  Pi  schedules  task  T\,  all  the  instances  of  T\ 
in  other  scheduling  lists  will  also  be  removed.  If  the  priorities  of  a  task  are  changed  at  runtime, 
the  location  of  the  task  in  each  list  must  be  changed  accordingly. 

A  very  important  result  for  the  p-list  model  is  that  the  use  of  p  scheduling  lists  is  sufficient  for 
specifying  all  parallel  scheduling  policies,  i.e.,  the  p-list  model  results  in  no  loss  of  generality 
with  respect  to  the  standard  scheduling  model  (as  described  in  Section  2. 1 ). 


Assertion  The  p-list  model  results  in  no  loss  of  generality  with  respect  to  the 
standard  scheduling  model. 

‘Note  that  some  systems  use  a  small  number  to  represent  a  high  priority  but  some  other  systems  use  a  large 
number.  For  consistency,  we  wi'l  always  use  a  large  number  to  represent  a  high  priority  in  this  thesis. 
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Justification.  This  follows  from  the  fact  that  we  can  recast  any  scheduling  al¬ 
gorithm  A  in  terms  of  our  p-list  model,  as  follows.  Consider  how  scheduling  is 
performed  when  a  processor  tries  to  schedule  a  task.  If  we  use  algorithm  A  (explic¬ 
itly  coded),  the  schedule  is  determined  by  executing  the  code;  while  if  we  use  the 
p-list  model,  the  schedule  is  determined  by  relative  task  priorities,  which  must  be 
settled  before  task  scheduling  time.  Since  priorities  may  be  changed  dynamically 
in  the  p-list  model,  we  only  n^^d  to  ensure  that  the  highest  priority  in  each  list  is 
always  assigned  to  the  “correct”  tasic.  To  be  more  precise,  assume  that  each  task 
7i  is  the  one  to  be  scheduled  on  processor  P,  by  algorithm  A  if  processor  P.  is  the 
next  processor  to  schedule  a  task.  In  the  p-list  model,  the  programmer  can  assign 
priorities  to  tasks  in  such  a  way  that  task  has  the  highest  priority  in  P  ’s  list 
(more  precisely,  the  ith  priority  of  TJ  is  the  highest  among  the  ith  priorities  of  all 
the  tasks).  Thus,  whichever  processor  requests  a  task  next,  the  p-list  system  will 
schedule  the  same  task  as  algorithm  A.  Thus  the  priority  assignment  scheme  in 
the  p-list  model  realizes  algorithm  A. 

The  above  assertion  shows  that  the  p-list  model  is  well  suited  to  the  specification  of 
scheduling  policies.  However,  we  do  not  want  to  naively  implement  each  scheduling  list  on  its 
corresponding  processor,  because  such  a  straightforward  implem'^ntation  would  be  inefficient 
on  a  distributed-memory  system,  due  to  the  following  two  communication  problems:  First, 
whenever  a  processor  inserts  or  deletes  a  task,  the  processor  would  have  to  inform  every  other 
processor.  This  will  result  in  a  large  amount  of  communication.  Second,  processors  mav 
compete  with  each  other  and  attempt  to  schedule  the  same  task  simultaneously. 


p*-List  Model 

In  order  to  solve  these  problems  of  excessive  communication  and  scheduling  competition 
(processor  race  conditions),  we  want  to  avoid  inserting/deleting  a  task  across  processors  while 
keeping  the  notion  of  using  p  scheduling  lists.  We  will  modify  the  p-list  model  as  illustrated 
in  Figure  2.5.  Let  the  original  scheduling  list  of  each  processor  P,  become  a  virtual  list  {VL), 
denoted  by  V L„  and  let  each  Pi  store  p  small  physical  lists  {PLs),  denoted  by  PL,[,  PL, 2, 
■  ■  •,  PL,p.  Each  individual  PL,j  is  physically  maintained  by  processor  P„  while  each  V L,  is 
conceptually  constructed  from  tasks  appearing  in  PLu,  PLu,  ■  ■  •,  PLp,.  Since  there  are  p- 
PLs  in  the  entire  system,  we  call  the  model  the  p--list  model.  In  a  VL,  the  tasks  are  merged  on 
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Figure  2.5:  p*-list  model. 


the  basis  of  the  priorities  assigned  to  them  on  the  corresponding  PLs.  When  creating  a  task, 
the  programmer  assigns  p  priorities  to  the  task  and  designates  a  processor,  say  P,,  where  the 
task  will  be  stored.  (Note  that  in  many  cases  the  designated  processor  is  set  to  the  one  creating 
the  task  so  that  the  insertion  operation  requires  no  communication.)  Subsequently,  the  system 
will  insert  the  task  into  each  PL.j  according  to  the  jth  priority  of  the  task.  Whenever  some 
processor  wants  to  schedule  this  task,  say  processor  Pz  schedules  T\  in  Figure  2.5,  we  can 
delete  task  T\  from  each  PLij.  Although  the  insertion  and  deletion  operation  only  updates  the 
PLs  on  P„  implicitly  each  VL  is  also  changed  accordingly. 

There  are  two  main  advantages  to  the  p~-\\st  model.  First,  if  several  processors  try  to 
simultaneously  schedule  a  task  which  is  stored  in  the  PLs  on  some  processor  P„  we  can 
resolve  this  by  letting  P,  determine  which  processor  can  schedule  the  task.  Second,  in  an 
actual  implementation  of  the  model,  each  processor  P,  only  needs  to  maintain  the  head  of  VZ.,, 
because  this  is  the  place  from  which  processor  P,  schedules  tasks.  (Chapter  4  will  discuss  this 
issue  in  greater  detail.)  Since  tasks  in  the  tail  of  V Li  are  not  interesting  to  P,,  we  may  eliminate 
the  communication  that  would  be  required  to  maintain  the  tail  of  VL,. 
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We  can  further  improve  the  modified  model  in  the  following  two  situations:  The  first 
situation  is  that  if  some  PLs  on  the  same  processor  are  identical,  we  only  need  to  use  one  PL 
to  represent  these  PLs.  For  example,  in  Figure  2.5,  we  can  use  one  PL  to  stand  for  PL\2  and 
PL\p,  if  they  are  always  the  same. 

The  second  situation  is  that  if  certain  priorities  of  each  task  are  monotonically  related,  then 
the  corresponding  PLs  can  be  reduced  to  a  single  PL,  supplemented  with  a  set  of  monotonic 
functions  for  deriving  the  other  PL’s.  For  example,  if  the  priority  for  each  task  in  one  PL 
(denoted  by  PL')  is  a  monotonic  function  /  of  its  priority  in  another  PL  (denoted  by  PL),  then 
the  system  can  perform  operations  on  PL'  based  on  the  data  structure  of  PL  and  the  priority 
translation  function  /,  as  follows.  (1)  We  only  need  to  insert  tasks  into  and  delete  tasks  from 
PL,  not  PL'.  (2)  When  some  processor  tries  to  schedule  a  task  from  PL',  the  system  can 
schedule  the  task  from  the  head  of  PL  if  the  function  /  is  monotonically  increasing,  or  from 
the  tail  of  PL'  if  /  is  monotonically  decreasing.  We  call  PL  a  base  PL  and  call  PL'  a  derived 
PL. 


2.3  Multilist  Scheduling  Model 


Now,  we  can  modify  the  p--list  model  by  allowing  the  programmer  to  use  one  list  to  represent 
another  iist(s),  and  formally  define  the  multilist  scheduling  model  as  follows. 


•  On  each  processor  P,,  the  programmer  creates  some  PLs,  say  4-,  PLs,  denoted  by  PL,j, 

1  <  >  <  k- 

•  For  each  VL„  the  programmer  designates  certain  PLs  (in  the  system)  which  are  merged 
into  VLt.  The  processor  Pi  will  automatically  schedule  a  task  from  the  head  of  \'L,, 
which  is  constructed  from  these  designated  PLs.  If  the  scheduled  task  is  from  processor 
Pi',  then  P,/  will  subsequently  delete  the  task  from  PL,m.  —  PL,'k^,.  Implicitly,  some 
VLs  are  also  changed  accordingly. 

•  At  runtime,  for  each  task  assumed  to  be  stored  on  some  processor  P,  (which  can  be 
designated  by  the  programmer),  the  programmer  assigns  to  the  task  k,  priorities,  denoted 
by  Vi,Tr2,  ....,Trk,.  The  system  will  insert  the  task  into  each  PL,j  according  to  tTj. 
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Figure  2.6;  Multilist  scheduling  model. 


Implicitly,  some  VLs  are  also  changed  accordingly.  Note  that  priorities  of  a  task  can  be 
changed  dynamically. 

The  programmer  uses  the  first  two  operations  to  specify  a  “scheduling  pattern”  by  which  the 
system  creates  PLs  and  merges  PLs  into  VLs.  Figure  2.6  is  an  example  of  scheduling  pattern. 
In  the  p-list  and  p’-list  models,  ve  have  mentioned  that  the  priorities  of  a  task  are  allowed  to 
be  changed  at  runtime.  In  the  multilist  scheduling  model,  the  scheduling  pattern  can  be  also 
changed  at  runtime.  Although  the  scheduling  patterns  are  usually  fixed,  we  will  present  a  good 
example  of  a  scheduling  algorithm  (in  Section  3. 1 .2.2)  whose  scheduling  pattern  is  changed  at 
runtime. 

In  addition,  if  a  PL  can  be  derived  from  another  PL  with  a  monotonic  priority  translation 
function  /,  the  programmer  only  needs  to  specify  this  derivation  relation  for  the  derived  PL  and 
does  not  need  to  assign  redundant  priorities.  The  system  will  only  insert  tasks  into  and  delete 
tasks  from  the  base  PL,  not  the  derived  PL.  In  order  to  schedule  a  task  from  the  derived  PL,  the 
system  schedules  the  task  from  the  head  of  the  tail  of  the  base  PL,  depending  on  whether  /  is 
monotonically  increasing  or  decreasing.  Since  the  function  /  is  a  user-defined  runtime  function 
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and  it  can  also  depend  on  some  runtime  information,  this  potentially  provides  a  simple  way  to 
dynamically  change  priorities,  though  we  have  not  found  an  !^)propriate  example  to  present  in 
this  thesis. 

The  above  definition  captures  the  essence  of  the  multilist  scheduling  model.  Chapter  4 
will  show  that  the  addition  of  parameters  for  tasks,  PLs,  and  VLs  can  further  improve  the 
performance  of  the  system.  For  example,  the  programmer  can  give  an  estimated  grain  size  for 
each  task  so  that  our  system  can  use  it  to  balance  the  load.  Appendix  A  presents  the  interface 
for  the  model. 

After  the  programmers  specify  the  above  information,  they  can  assume  that  the  system  will 
schedule  tasks  in  an  order  roughly  corresponding  to  the  specified  priorities.  However,  this 
order  may  vary  slightly  foi  heuristic  reasons.  So,  programmers  should  not  reply  on  priorities 
for  program  correctness. 


2.4  Discussion 

If  a  programmer  wants  to  implement  a  scheduling  algorithm  without  the  help  of  our  model,  he 
or  she  needs  to  do  the  following  work. 


1 .  Set  up  the  scheduling  pattern  for  the  scheduling  algorithm. 

2.  Assign  priorities  (and  some  other  parameters)  to  each  task. 

3.  Implement  each  PL  in  the  pattern. 

4.  Implement  the  details  of  scheduling  tasks  from  VLs  according  to  the  pattern. 


The  multilist  scheduling  model  successfully  decomposes  the  above  work  into  two  parts: 
the  first  two  items  and  the  last  two  items.  Programmers  only  need  to  specify  the  scheduling 
policy  in  the  first  part,  while  the  second  part  (the  details  of  maintaining  PLs  and  VLs)  can 
be  hidden  from  programmers.  In  fact,  maintaining  PLs  and  VLs  is  the  most  difficult  part  of 
writing  a  scheduling  algorithm.  Maintaining  VLs  is  especially  complicated  due  to  the  need  for 
interprocessor  communication. 
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The  decomposition  of  the  task  scheduling  process  allows  the  programmers  to  focus  on 
designing  efficient  scheduling  algorithms,  while  allowing  the  system  designers  to  focus  on 
designing  efficient  supportive  routines.  In  the  former  case,  programmers  are  encouraged  to 
design  more  interesting  and  efficient  scheduling  algorithms,  which  are  hard  to  implement 
without  the  help  of  our  system.  Chapter  3  will  illustrate  several  scheduling  algvyi  ithms  based 
on  this  model.  In  the  latter  case,  the  system  designers  can  continue  to  develop  better  methods 
to  optimize  the  performance  of  the  supportive  routines.  Chapter  4  will  propose  some  efficient 
techniques  to  maintain  VLs  and  PLs  for  the  current  implementation. 

In  addition,  we  argue  that  it  is  easy  for  (scheduling)  programmers  to  specify  scheduling 
policies  (items  1  and  2)  based  on  the  multilist  scheduling  system.  When  programmers  come 
up  with  scheduling  algorithms,  the  programmers  usually  can  easily  figure  out  the  scheduling 
sequences  of  tasks  (actually  they  are  just  the  task  sequences  in  VLs).  Therefore,  we  argue  that 
they  should  be  able  to  specify  the  scheduling  patterns  and  to  assign  priorities  corresponding 
to  the  scheduling  sequences.  We  illustrate  the  simplicity  of  implementing  many  scheduling 
algorithms  based  on  this  model  in  the  next  chapter.  At  least,  it  is  less  painful  to  implement 
scheduling  algorithms  based  on  our  model  than  to  write  task  scheduling  routines  from  scratch. 
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Chapter  3 


Examples  of  Scheduling  Algorithms 


This  chapter  will  develop  multilist  scheduling  schemes  for  several  scheduling  algorithms  to 
demonstrate  how  easily  the  model  can  be  utilized  and  how  widely  the  model  can  be  applied  to 
applications.  Section  3. 1  will  show  several  main  examples  of  scheduling  algorithms,  each  using 
one  distinct  scheduling  pattern  (defined  in  Section  2.3),  Section  3.2  will  show  other  examples 
of  scheduling  algorithms  whose  scheduling  patterns  are  the  same  as  the  ones  in  Section  3.1. 
Finally,  Section  3.3  will  give  some  more  discussion. 


3.1  Main  Examples 


This  section  will  show  several  main  examples  of  scheduling  algorithms,  each  using  one  distinct 
scheduling  pattern.  Section  3.1.1  will  propose  two  multilist  scheduling  schemes  to  implement 
two  different  scheduling  algorithms  for  parallel  best-first  search.  In  both  of  the  schemes, 
each  processor  needs  to  create  one  physical  list  (PL).  Section  3. 1 .2  will  propose  two  multilist 
scheduling  schemes  to  implement  two  different  scheduling  algorithms  for  parallel  divide-and- 
conquer.  In  bo.h  of  the  schemes,  each  processor  needs  to  create  two  PLs.  Section  3. 1 .3 
will  propose  a  multilist  scheduling  scheme  to  implement  a  scheduling  algorithm  for  parallel 
synchronous  network  simulation  problems.  In  this  scheme,  each  processor  needs  to  create  p 
PLs,  if  there  are  p  processors. 
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3.1.1  Parallel  Best-First  Search 


Best-first  search  (BFS)  is  a  common  computation  paradigm,  in  which  the  program  always 
schedules  the  best  task  among  all  possible  tasks  for  execution.  Examples  of  BFS  computations 
include  some  branch-and-bound {B&B)  problems  such  as  the  traveling  salesman  problem  (TSP) 
[29],  and  some  state-space  search  problems  such  as  the  IS-puzzle  [75]. 

A  BFS  computation  can  be  viewed  as  a  process  of  expanding  a  tree.  Each  node  in  the  tree 
corresponds  to  a  problem  instance,  and  children  of  the  node  correspond  to  its  subproblems. 
Each  node  has  an  associated  cost  of  its  corresponding  problem  instance.  The  computation 
always  chooses  the  node  with  the  least  cost  for  execution.  A  node  is  called  a  solution  node  if 
it  represents  a  solution.  A  BFS  computation  tries  to  hnd  among  all  the  solution  nodes  the  one 
with  the  least  cost. 

Most  BFS  algorithms  maintain  an  invariant  property,  called  admissibility  [75],  in  which  the 
cost  of  each  node  is  less  than  or  equal  to  the  costs  of  its  children.  In  these  algorithms,  a  BFS 
computation  terminates  when  it  has  found  some  solution  nodes,  the  least  cost  among  which 
is  Cmxn,  and  it  has  expanded  all  nodes  with  costs  smaller  than  Cmm-  Since  the  descendants 
of  current  nodes  all  have  no  smaller  costs  due  to  admissibility,  we  will  not  be  able  to  find  a 
solution  node  with  a  cost  smaller  than  C^m.  Thus,  the  node  with  cost  Cm,n  is  the  result  of  this 
computation. 


3.1.1.1  Scheduling  Algorithm  Based  on  a  Global  Priority  Queue 

In  order  to  solve  BFS  efficiently  in  parallel,  we  can  use  a  simple  scheduling  algorithm  requiring 
a  global  priority  queue  (GPQ),  as  shown  in  Figure  3.1(a).  This  scheduling  algorithm  will  be 
called  PBFS-GPQ  (Parallel  BFS  with  a  GPQ)  in  this  thesis.  In  this  algorithm,  each  new  node 
corresponding  to  a  task  is  inserted  into  the  GPQ  in  accordance  with  the  cost  associated  with  the 
node;  each  processor  schedules  the  node  with  the  least  cost  in  the  GPQ.  Zhang  [108]  proved 
that  the  algorithm  can  balance  the  load  very  well  (without  considering  the  communication 
overhead). 

The  PBFS-GPQ  scheduling  algorithm  described  above  can  be  easily  implemented  in  the 
multilist  scheduling  model.  The  algorithm,  whose  scheduling  pattern  is  shown  in  Figure  3.1(b), 
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(b) 


Figure  3.1:  Scheduling  algorithm  for  PBFS-GPQ:  (a)  using  one  GPQ  and  (b)  using  our  model. 
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is  described  as  follows: 


•  On  each  processor,  create  one  PL. 

•  For  each  node  (corresponding  to  a  task)  with  cost  C,  assign  a  priority,  w  =  —C.  Note 
that  niinimuin  cost  translates  to  highest  priority. 

•  For  each  processor  P„  merge  all  the  PLs  (from  all  the  processors)  into  V L,.  Thus,  each 
virtual  list  (VL)  is  actually  identical  to  the  GPQ. 


Since  each  processor  schedules  a  task  from  its  VL  which  is  identical  to  the  GPQ,  the  above 
multilist  scheduling  scheme  realizes  the  PBFS-GPQ  scheduling  algorithm.  Note  that  this  is  a 
basic  technique  to  form  a  GPQ. 

The  scheduling  pattern  for  PBFS-GPQ  will  recur  throughout  this  chapter.  We  therefore  fintJ' 
it  useful  to  define  a  “global  scheduling  subpattem”,  as  follows. 


Definition  3.1  A  subpattem  of  a  scheduling  pattern  is  called  a  “global  scheduling 
subpattem  ”  if  it  contains  at  least  one  PL  on  each  processor,  and  all  its  PLs  are 
merged  into  all  VLs. 


As  for  the  interface  to  the  application  layer,  application  programmers  only  need  to  do  the 
following  two  things: 


•  Declare  initially  that  the  PBFS-GPQ  scheduling  algorithm  will  be  used.  (This  establishes 
the  appropriate  scheduling  pattern  in  the  scheduling  layer.) 

•  Declare  the  cost  of  a  node  whenever  one  is  created.  (This  allows  the  node’s  priority  to 
be  calculated  in  the  scheduling  layer.) 
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3.1.U  Scheduling  Algorithm  vrith  Randomi2ation 

For  parallel  best-first  search.  Karp  and  Zhang  [S7]  proposed  a  scheduling  algorithm  with  the 
technique  of  randomization.  This  scheduling  algorithm  will  be  called  PBFS-R  (Parallel  BFS 
with  Randomization)  in  this  thesis.  In  this  algorithm,  each  processor  has  one  local  priority 
queue.  Whenever  a  node  is  created,  we  raiidomly  select  a  destination  processor  and  then  store 
the  task  into  the  local  priority  queue  of  that  processor,  according  to  the  cost  of  the  node.  Each 
processor  always  schedules  tasks  for  execution  from  its  own  local  priority  queue.  Karp  and 
23iang  also  proved  that  this  algorithm  can  balance  the  load  well  with  a  very  high  probability. 

The  PBFS-R  scheduling  algorithm  described  above  can  also  be  easily  implemented  in  the 
multilist  scheduling  model,  as  follows: 


•  On  each  processor,  create  one  PL. 

•  When  a  node  (corresponding  to  a  task)  with  cost  C  is  created,  designate  a  proces.sor  at 
random  to  store  the  task,  and  then  assign  a  priority,  tt  =  -C,  to  the  task. 

•  For  each  processor  P,,  only  merge  its  local  PL,  into  VL,.  Thus,  each  V L,  is  actually 
identical  to  PL,.  So,  we  perform  communication  only  when  a  task  is  created,  not  when 
a  task  is  scheduled. 


Hence,  this  multilist  scheduling  scheme  above  realizes  the  Karp  and  Zhang’s  scheduling 
algorithm.  This  is  a  good  example  of  an  algorithm  in  which  a  task  is  stored  in  the  PLs  on  a 
different  processor  from  the  one  creating  the  task. 


3.1.2  Parallel  Divide-and-Conquer 

Divide-and-Conquer{D&C)  is  another  common  computation  paradigm,  in  which  the  solution  of 
a  problem  is  obtained  by  solving  its  subproblems  recursively.  Examples  of  D&C  computations 
include  various  sorting  methods  such  as  quicksort  [43],  computational  geometry  procedures 
such  as  convex  hull  calculation  [79],  AI. search  heuristics  such  as  constraint  satisfaction  tech¬ 
niques  [40],  adaptive  data  classification  procedures  such  as  generation  and  maintenance  of 
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quadtrees  [86],  and  numerical  methods  such  as  multigrid  algorithms  [72]  for  solving  partial 
differential  equations. 

A  D&C  computation  can  be  viewed  as  a  process  of  expanding  and  shrinking  a  tree.  Each 
node  in  the  tree  corresponds  to  a  problem  instance;  children  of  the  node  correspond  to  its 
subproblems.  During  the  computation,  each  internal  (non-leaf)  node  goes  through  two  phases. 
The  first  phase  is  the  divide  phase  during  which  the  problem  instance  associated  with  the  node 
is  divided  into  subproblems.  The  second  phase  is  the  combine  phase  during  which  the  solution 
of  the  problem  instance  associated  with  the  node  is  derived  by  combining  solutions  of  the 
subproblems  associated  with  the  node’s  children.  Each  leaf  after  its  creation  will  perform  some 
computation  and  return  the  results  to  its  parent.  At  a  given  time,  nodes  on  a  wavefront  that 
cuts  across  all  paths  from  the  root  to  leaves  can  be  active  in  performing  divide,  combine,  or 
compute  operations.  Along  each  path  the  wavefront  first  moves  down  from  the  root  to  its  leaf 
and  then  up  from  the  leaf  to  the  root.  For  simplicity  of  discussion,  we  will  ignore  the  shrinking 
phase.  It  will  be  useful  to  define  di  frontier  node  as  a  node  which  has  been  generated  but  has 
not  been  expanded  and  a  local  frontier  node  of  a  processor  as  a  frontier  node  whose  parent  was 
expanded  (or  executed)  on  this  processor. 


3.U.1  Wu'Kung’s  Scheduling  Algorithm 

In  order  to  perform  D&C  efficiently  in  parallel,  we  [105]  designed  an  efficient  scheduling 
algorithm,  called  PDC-WK  (Parallel  D&C  with  Wu  and  Kung’s  method)  in  this  thesis.  The 
PDC-WK  scheduling  algorithm  schedules  nodes  according  to  the  following  rules.  First,  if 
a  processor  has  local  frontier  nodes,  it  must  schedule  the  deepest  among  them.  This  is  the 
depth-first  search  which  can  minimize  the  local  memory  requirement  and  avoid  wasteful 
interprocessor  communications.  Second,  when  a  processor  runs  out  of  local  frontier  nodes, 
it  follows  breadth-first  search  to  schedule  a  frontier  node  closest  to  the  root,  from  all  (other) 
processors.  Note  that  the  node  closest  to  the  root  is  likely  to  contain  the  largest  subtrees,  which 
will  have  the  most  locality  and  therefore  will  need  the  least  communication. 

We  also  proved  that,  among  all  the  scheduling  algorithms  which  can  split  the  load  nearly 
evenly,  our  algorithm  is  optimal  with  respect  to  the  communication  cost,  which  is  defined  to 
be  equal  to  the  total  number  of  cross  nodes.  (A  cross  node  is  a  node  which  is  generated  by 
one  processor  but  expanded  by  another  processor.)  A  more  detailed  description  for  the  above 
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Figure  3.2:  Scheduling  pattern  for  PDC-WK. 
results  will  be  presented  in  Section  5. 1 . 

The  PDC-WK  scheduling  algorithm  described  above  can  be  easily  implemented  in  the 
multilist  scheduling  model.  The  algorithm,  whose  scheduling  pattern  is  shown  in  Figure  3.2, 
is  described  as  follows. 


•  Create  two  PLs,  local  list  (LL)  and  global  list  {GL),  on  each  processor. 

•  For  each  processor  P,,  merge  all  the  GLs  (from  all  the  proces.sors)  and  its  own  LL  into 
VLf  According  to  Definition  3.1,  this  scheduling  also  includes  a  global  scheduling 
subpattem  like  PBFS-GPQ,  but  each  proce.ssor  may  also  schedule  nodes  from  its  own 
LL. 

•  For  each  node  (corresponding  to  a  task),  assign  to  it  two  priorities:  local  priority  =  / 
(corresponding  to  LL)  and  global  priority  ira  =  (corresponding  to  GL),  where  /  is  the 
level  of  the  node  in  the  tree.  A  node  is  said  to  be  at  tree  level  i  if  it  is  the  /th  node  on  the 
path  from  the  root  to  the  node.  The  root  is  at  tree  level  0. 
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The  above  multilist  scheduling  scheme  satisfies  the  scheduling  rules  of  PDC~WK,  as  we 
shall  now  show.  Since  ttl  =  ^  the  processor  always  follows  depth-first  search  to  schedule 
a  local  node,  if  one  exists;  otherwise,  because  ttq  =  the  processor  always  schedules  the 
node  closest  the  root  from  processors.  Thus,  this  scheme  realizes  the  PDC-WK  scheduling 
algorithm. 

Since  ttl  =  —ttg  in  the  above  scheduling  scheme,  we  can  further  impro'  e  it  by  letting  LL 
be  a  derived  PL  based  on  GL  with  the  priority  translation  function  /(tt)  =  -tt.  This  is  a  good 
example  showing  that  derived  PLs  can  be  used  to  optimize  the  performance. 

As  for  the  interface  to  the  application  layer,  application  programmers  need  to  do  the 
following  two  things. 


•  Declare  initially  that  the  PDC  WK  scheduling  algorithm  will  be  used.  (This  establishes 
the  appropriate  scheduling  pattern  in  the  scheduling  layer.) 

•  Declare  the  tree  level  I  of  a  node  whenever  a  new  node  is  created.  (This  allows  the  node's 
priorities  ttl  and  tg  to  be  calculated  from  /  in  the  scheduling  layer.) 


3.1J2.2  Scheduling  Algorithm  Based  on  the  Round-Robin  Strategy 

For  parallel  D&C,  some  researchers  in  (34,  44,  82]  have  used  another  scheduling  algorithm 
based  on  a  round-robin  strategy.  This  scheduling  algorithm,  called  PDC-RR  (Parallel  D&C 
with  the  Round-Robin  strategy)  in  this  thesis,  is  the  same  as  the  PDC-WK  scheduling  algorithm 
except  for  the  following:  in  PDC-RR  an  idle  processor  will  try  to  schedule  Mie  node  closest  to 
the  root  among  those  on  its  pre-selected  processor  (not  in  the  whole  system  as  PDC-WK),  which 
is  dynamically  changed  in  a  round-robin  fashion  as  follows.  Each  processor  has  a  variable 
s,  representing  the  ID  of  the  pre-selected  processor.  When  the  processor  is  idle,  it  requests  a 
frontier  node  from  processor  P,  (as  above)  and  lets  the  next  3  be  (5  mod  p)  1  such  that  the 
next  node  request  will  be  sent  to  the  next  processor  F(,modp)+i-  One  possible  drawback  for 
PDC-RR  is:  since  the  node  scheduled  by  the  idle  processor  is  closest  to  the  root  only  on  P,  (not 
globally),  some  other  processors  may  have  nodes  much  closer  to  the  root,  i.e.,  it  is  very  likely 
that  the  scheduled  node  has  not  the  largest  subtree. 
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Since  the  difference  between  the  PDC-RR  and  PDC-WK  scheduling  algorithms  is  in  the  set 
of  processors  from  which  an  idle  processor  will  schedule  a  node,  we  can  implement  PDC-RR  by 
modifying  PDC-WK  as  follows:  For  each  processor  P,,  merge  the  GL  on  processor  P,,  and  the 
LL  on  Pi  into  V Li,  where  s,  is  a  variable  on  P,.  After  processor  P,,  is  requested,  the  variable 
3i  is  changed  to  (sj  mod  p)  +  1.  Thus,  this  above  scheduling  policy  based  on  the  multilist 
scheduling  model  realizes  the  PDC-RR  scheduling  algorithm.  This  scheduling  algorithm  is  a 
good  example  showing  that  the  scheduling  pattern  may  be  changed  dynamically. 


3.13  Parallel  Synchronous  Network  Simulation 

Section  2.2  defined  the  Synchronous  Network  Simulation  (SNS)  problem  by  using  a  graph  to 
represent  it.  In  that  section,  we  also  mentioned  a  dynamic  scheduling  strategy  based  on  the 
criterion  of  keeping  the  total  number  of  cross  edges  (also  see  the  definition  in  that  section)  as 
small  as  possible.  In  this  section,  we  will  first  describe  the  scheduling  algorithm  based  on  the 
dynamic  scheduling  strategy  and  then  propose  a  corresponding  multilist  scheduling  scheme. 

In  this  scheduling  algorithm,  we  partition  the  graph  of  nodes  (each  corresponding  to  a 
thread)  over  the  processors  at  the  beginning  of  the  computation.  Then,  during  each  phase,  we 
dynamically  balance  the  load  as  follows: 


1 .  If  a  processor  has  some  local  nodes  (residing  on  the  processor)  that  have  not  been  executed 
yet,  it  always  schedules  the  local  nodes  which  are  connected  to  no  cross  edges;  then,  it 
schedules  the  local  nodes  which  are  connected  to  some  cross  edges. 

2.  When  a  processor  has  executed  ail  local  nodes,  it  .schedule  a  node  from  another  processor 
such  that  the  total  number  of  cross  edges  increases  the  least. 


The  scheduling  algorithm  described  above  can  be  easily  implemented  in  the  multilist  model. 
The  algorithm,  whose  scheduling  pattern  is  the  same  as  the  one  in  Figure  2.5,  for  a  p-processor 
system  is  described  as  follows: 


•  Create  p  PLs  on  each  processor.  Let  PL,j  denote  the  jth  PL  on  processor  P,. 
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•  For  each  processor  P,,  merge  all  the  tth  PLs  (i.e.,  Plj,  for  all  j)  into  V I,. 

•  Assign  p  priorities  to  a  node  on  Pi  as  follows: 

-  The  jth  priority  of  a  node  u,  j  51^  t,  is  irj  =  Eu,j  —  E„,i,  where  Eu,k  is  the  number 
of  edges  (including  in-edges  and  out-edges)  between  u  and  all  the  nodes  on  Pt. 

-  If  a  node  u  is  connected  to  some  cross  edges,  its  ith  priority  t,  is  set  to  Emax  +  1 , 
where  Emax  is  the  maximum  number  of  edges  which  each  node  can  have;  otherwise, 
the  priority  is  set  to  an  even  higher  priority,  say  Emax  +  2. 

One  can  verify  that  the  above  multilist  scheduling  scheme  satisfies  the  two  scheduling  rules 
for  parallel  SNS. 

We  note  that  this  scheduling  algorithm  needs  to  change  priorities  at  run  time.  When  a  node 
is  moved  to  another  processor,  some  Eu,i  values  may  be  changed  and  therefore  the  priorities  of 
the  node’s  neighbors  will  be  changed  accordingly.  This  is  a  good  example  showing  the  need  of 
changing  priorities  dynamically. 

As  for  the  interface  to  the  application  layer,  application  programmers  only  need  to  do  the 
following: 

•  Initialize  the  scheduling  algorithm  for  parallel  SNS  by  describing  its  graph  of  nodes, 
corresponding  to  threads  or  tasks.  (This  establishes  the  appropriate  scheduling  pattern, 
and  allows  the  priorities  of  the  nodes  to  be  calculated  from  their  environment  in  the  graph, 
in  the  scheduling  layer.) 


3.2  Other  Examples  of  Scheduling  Algorithms 


This  section  will  present  some  more  examples  of  scheduling  algorithms:  the  factoring  algorithm 
for  parallel  loops  in  Section  3.2.1,  the  principle  variation  splitting  algorithm  (slightly  modified 
from  [21,  69])  for  a-d  search  in  Section  3.2.2,  the  scheduling  algorithm  for  parallel  quicksort 
in  Section  3.2.3,  and  the  scheduling  algorithm  for  parallel  asynchronous  network  simulation  in 
Section  3.2.4.  The  scheduling  patterns  for  these  scheduling  algorithms  are  the  same  as  the  ones 
used  in  the  previous  section. 
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3Jt.l  Parallel  Loops  with  the  Factoring  Technique 


Parallel  loops  (without  dependencies  between  their  iterations)  are  very  rich  resources  where 
we  can  exploit  parallelism  in  many  applications,  especially  for  scientific  applications.  Since 
the  amount  of  computation  in  each  iteration  may  not  be  Axed,  an  efficient  algorithm  needs  to 
balance  the  load  at  runtime  while  minimizing  the  amount  of  communication. 

Hummel  et  al.  [48]  proposed  an  efficient  runtime  technique,  called  factoring.  Consider  how 
to  parallelize  a  parallel  loop  with  m  iterations  on  p  processors.  Without  loss  of  generality,  we 
assume  m  to  be  p(2^  —  1),  where  A;  is  an  integer.  In  the  factoring  technique,  iterations  are  first 
grouped  into  tasks  such  that  there  are  p  tasks  each  of  which  will  execute  2*~ '  iterations,  p  tasks 

each  of  which  will  execute  2*“^ . and  p  tasks  each  of  which  will  execute  one  iteration.  Then, 

each  processor  always  schedules  the  task  with  the  largest  number  of  iterations  next  because  the 
fine  grained  tasks  should  be  preserved  for  better  load  balancing  near  the  end.  Since  the  total 
number  of  tasks  is  pk,  we  schedule  tasks  at  most  pk  times,  which  is  a  very  small  number  when 
compared  with  m. 

We  can  implement  this  algorithm  in  our  multilist  model  by  simply  using  the  scheduling 
pattern  of  the  PBFS-GPQ  scheduling  algorithm  and  letting  each  task  with  2'  iterations  have  the 
priority  i. 

Nishikawa  [70]  has  recently  suggested  the  following  modihcation  to  further  reduce  com¬ 
munication  in  a  distributed-memory  system.  Initially,  we  evenly  distribute  these  tasks  over 
the  processors  such  that  each  processor  has  one  task  with  2*'“'  iterations,  one  task  with  2'‘~- 
iterations, ...,  and  one  task  with  one  iteration.  Basically,  each  processor  executes  its  own  tasks 
based  on  the  same  strategy:  schedule  the  task  with  the  largest  number  of  iterations  first.  How¬ 
ever,  when  a  processor  (say  Pi )  “falls  far  behind”  another  processor  (say  P:),  P|  will  “off-load" 
some  task  to  Pi-  To  be  more  precise,  let  us  consider  the  situation  in  which  processor  P\  is  the 
slowest  processor  which  has  the  task  with  2'  iterations  and  processor  P^  is  the  fastest  processor 
which  has  the  task  with  2^  iterations,  where  i  >  j.  The  programmer  can  decide  a  threshold  t 
(a  positive  integer).  Then,  if  and  only  if  i  —  j  >  t,  processor  Pi  will  schedule  the  task  with  2' 
iterations  from  P\ . 

Nishikawa’s  scheduling  algorithm  can  also  be  easily  implemented  in  our  multilist  model  by 
using  the  scheduling  pattern  of  PDC-WK.  Initially,  we  partition  tasks  as  above  and  then  assign 
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priorities  to  each  task  with  2'  iterations  as  foliows:  the  local  priority  ttl  ~  i  and  the  global 
priority  iTa  =  i  —  t.  For  the  situation  given  in  the  previous  paragraph,  the  task  with  2'  iterations 
on  processor  Fi  has  a  global  priority  i  —  t  while  the  task  with  2^  iterations  on  P3  has  a  local 
priority  j.  If  i  —  j  >  t  (i.e.,  i  —  t>  j),  then  P2  will  schedule  the  task  with  2'  iterations  from  P\ . 
Otherwise,  no  load  balancing  is  required.  Thus,  this  multilist  scheduling  scheme  is  the  same 
as  Nishikawa’s  scheduling  algorithm. 

Since  =  xg  +  t,  we  can  improve  the  scheduling  scheme  by  letting  LL  be  a  derived  PL 
based  on  GL  with  the  priority  translation  function  /( x)  =  x  +  L  This  is  another  example  (PDC- 
WK  was  the  first  example)  showing  that  we  can  use  derived  PLs  to  optimize  the  performance. 
The  priority  translation  function  here  is  monotonically  increasing  while  that  for  the  PDC~WK 
scheduling  algorithm  is  monotonically  decreasing. 


3.2.2  Parallel  Alpha<Beta  Search  with  Principle  Variation  Splitting  Algo¬ 
rithm 

Alpha-beta  (a-/3)  search  [58]  is  a  common  computational  paradigm  for  two-playt  r  game  search 
problems,  e.g..  Chess  [91]  and  Othello  [64].  An  a-3  computation  can  also  be  viewed  as  a 
process  of  expanding/shrinking  a  tree,  as  in  D&C  in  Section  3.1.2,  but  with  the  following 
properties. 


•  Each  leaf  node  has  an  estimated  heuristic  value  and  returns  this  value  to  its  parent. 

•  Each  internal  node  in  the  divide  phase  expands  some  children  and  sorts  them  according 
to  how  likely  they  are  to  contain  an  optimum  heuristic  value  in  the  subtree  rooted  at  the 
child.  We  will  let  the  leftmost  child  be  the  most  promising  child. 

•  Each  internal  node  in  the  combine  phase  receives  all  the  values  returned  from  its  children 
and  then  returns  the  maximum  of  the  negatives  of  these  values. 

•  The  solution  of  a  game  tree  is  the  returned  value  of  the  root  and  the  identity  of  the  child 
who  has  the  negative  of  that  value. 
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In  order  to  solve  a- 13  efficiently  in  parallel,  sonie  researchers  [21,  69]  have  proposed  the 
principle  variation  splitting  (PVS)  algorithm.  In  the  PVS  algorithm,  we  search  the  path  from 
the  root  to  the  leftmost  leaf,  in  the  first  stage.  The  path  is  called  principle  variation  (PV)  and 
nodes  on  PV  are  called  PV  nodes-,  in  addition,  a  subtree  is  called  a  PV-subtree,  as  shown  in 
Figure  3.3,  if  its  root  is  a  child  of  a  PV  node  but  the  root  itself  is  not  a  PV-node.  Then,  in  the 
next  stage,  PV-subtrees  rooted  at  the  deepest  tree  level  are  split  among  processors.  After  this 
stage  completes,  subtrees  rooted  at  the  second  deepest  tree  level  are  split  among  processors  in 
the  next  stage,  and  so  on.  These  researchers  used  the  so-called  tree  splitting  algorithm  in  [69] 
to  split  subtrees  among  processors.  But,  here,  we  will  simply  use  the  PDC-WK  scheduling 
algorithm  (described  in  Section  3. 1 .2)  to  split  subtrees  among  processors  so  that  we  balance  the 
load  of  nodes  in  these  subtrees  while  minimizing  the  communication.  Although  the  original 
PVS  algorithm  goes  through  the  stages  serially,  our  scheduling  algorithm  for  PVS  does  not  have 
to.  That  is,  if  there  are  no  available  nodes  for  the  current  stage,  idle  processors  can  schedule 
the  nodes  for  the  next  stage. 

Now,  we  want  to  design  the  PVS  algorithm  based  on  the  multilist  scheduling  model  as 
follows.  The  multilist  scheduling  scheme  for  PVS  uses  the  same  scheduling  pattern  as  PDC- 
WK.  Note  that  each  processor  has  two  physical  lists,  the  local  list  (LL)  and  the  global  list  (GL). 
(For  simplicity  of  discussion,  we  omit  the  combine  operation  in  a-d  search,  as  we  did  for 
D&C.)  We  assign  priorities  to  each  node  according  to  the  following  rules. 


•  For  each  node  on  PV,  its  local  priority  is  =  Wmax  and  its  global  priority  is  ttg  =  7r,„or, 
where  ir„ax  is  the  priority  larger  than  any  of  the  priorities  assigned  below. 
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•  For  each  node  (corresponding  to  a  task)  at  tree  level  i  and  in  a  PV-subtree  rooted  at  tree 
level  j,  let  the  local  priority  be  c  *  j  + 1  and  the  global  priority  hec*j-i,  where  c>  2h 
and  h  is  the  tree  height. 

From  the  first  rule,  executing  the  nodes  in  PV  is  the  first  priority.  From  the  second  rule, 
since  c  >  2k  and  h  >  i,  j  represents  the  primary  key  and  i  represents  the  secondary  key. 
Since  j  is  the  primary  key,  we  search  PV-subtrees  rooted  at  a  deeper  tree  level  earlier.  For  the 
PV-subtrees  rooted  at  the  same  tree  level,  the  value  j  is  the  same  and  therefore  the  scheduling 
algorithm  is  just  the  same  as  PDC-WK.  Thus,  the  above  scheduling  scheme  realizes  our  PVS 
algorithm. 

As  for  the  interface  to  the  application  layer,  application  programmers  only  need  to  do  the 
following  things. 

•  Declare  initially  that  the  PVS  scheduling  algorithm  will  be  used  (this  will  establish  the 
scheduling  pattern  of  PDC-WK  in  the  scheduling  layer);  then,  give  the  maximum  tree 
height  A, 

•  For  each  node  (corresponding  to  a  task),  declare  the  tree  level  i  of  the  node  and  the  tree 
level  j  of  the  PV-subtree  that  includes  the  node.  (This  allows  the  node’s  priorities  ttl  and 
na  to  be  calculated  from  i,  j  and  h  in  the  scheduling  layer.) 


3.2  J  Parallel  Quicksort  Algorithm 


Sorting  is  the  most  common  operation  for  data  processing.  Given  an  array  of  n  elements  each 
containing  a  key  (not  ordered),  the  problem  is  to  sort  the  elements  in  the  array  according  to  the 
key  values. 

Quicksort  [43]  is  a  fast  sorting  algorithm,  based  on  the  divide-and-conquer  technique,  and 
with  an  average  computation  time  of  0(n  logn).  This  algorithm  is  described  as  follows: 


•  Pick  the  first  element  (or  pick  one  at-random).  Let  k  be  the  key  value  of  this  element. 
'If  g{n)  =  0(/(n)),  there  exists  some  positive  c  for  which  g(n)  <  cf(n)  for  all  sufficiently  large  n. 
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•  Partition  the  array  into  two  subarrays  of  elenients  such  that  the  key  values  of  elements  in 
one  subarray  are  less  than  k  and  the  key  values  in  the  other  subarray  are  greater  than  or 
equal  to  k. 

•  Recursively  sort  each  subarray. 


Since  the  quicksort  algorithm  is  based  on  the  D&C  technique,  we  can,  of  course,  apply 
the  PDC-WK  scheduling  algorithm  to  the  quicksort  algorithm.  However,  based  on  some 
characteristics  of  quicksort,  we  suggest  different  priority  assignment  for  possibly  improving 
the  PDC-WK  scheduling  algorithm. 

In  PDC~WK,  since  it  is  assumed  that  the  shape  of  a  tree  (or  subtree)  cannot  be  predicted  a 
priori,  we  can  only  use  the  tree  depth  of  a  node  to  roughly  estimate  the  average  computation 
amount  (or  locality)  of  a  node  and  then  use  it  to  evaluate  the  global  priority.  However,  for 
the  quicksort  problem,  the  average  time  complexity  of  a  node,  corresponding  to  a  sorting 
subproblem  with  an  array  of  n'  elements,  is  0{n'  logn').  Since  a  node  with  a  larger  value  of 
n'  will  typically  require  more  computation,  an  idle  processor  wants  to  schedule  the  node  with 
the  largest  n'  among  all  the  other  processors.  So,  we  can  use  n'  -  n  to  represent  the  global 
priority  of  the  node,  instead.  Note  that  the  item  -n  is  added  to  the  global  priority  to  ensure 
that  the  global  priority  is  non-positive  and,  therefore,  less  than  or  equal  to  the  local  priority.  As 
for  the  local  priority,  since  each  processor  may  want  to  sort  a  small  local  array  first  in  order  to 
preserve  the  large  array  tasks  for  later  donation  (note  that  donating  a  task  with  greater  locality 
may  reduce  the  communication  amount  as  mentioned  earlier),  we  can  let  the  local  priority  be 
n  —  n' . 

One  potential  problem  for  the  above  algorithm  is  that  the  value  of  n'  can  be  any  number 
between  0  and  n  and  therefore  there  could  be  too  many  distinct  priorities,  which  will  result  in 
more  task  scheduling  overhead  (this  will  be  discussed  in  greater  detail  in  the  next  chapter).  To 
avoid  this  problem,  we  can  squash  the  range  of  priorities,  by  letting  the  local  priority  of  a  node 
be^  flg(n/n')]  and  the  global  priority  of  the  node  be  flg(n7n)l.  This  reduces  the  number  of 
distinct  priorities  to  at  most  2  Ign,  while  roughly  preserving  their  relative  sequence. 


^Ig  X  =  log2  X.  And,  fx]  is  the  smallest  integer  larger  than  or  equal  to  x. 
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3.2.4  Parallel  Asynchronous  Network  Simulation 


In  Section  3.1.3,  we  discussed  the  synchronous  network  simulation  problem.  However,  some 
network  simulation  problems  do  not  have  to  be  synchronized  at  the  end  of  each  phase.  In  those 
problems,  as  long  as  a  node  (thread)  has  received  all  of  its  coming  data  via  its  in-edges,  the  node 
can  advance  to  the  next  phase  and  continue  to  execute.  We  call  such  problems  asynchronous 
network  simulation  (ANS). 

For  parallel  ANS,  we  can  modify  the  scheduling  algorithm  for  parallel  SNS  (in  Section 
3.1.3)  by  redefining  the  ith  priority  ir-  of  a  task  (corresponding  to  a  process  in  some  phase)  as 
follow;  let  the  primary  key  be  the  negative  of  the  phase  index^  and  the  secondary  key  be  the 
original  tt^.  In  accordance  with  the  primary  key,  the  new  algorithm  will  try  to  schedule  all  the 
tasks  in  the  current  phase  before  advancing  to  the  next  phase.  Thus,  the  scheduling  algorithm 
basically  is  the  same  as  the  scheduling  algorithm  for  parallel  SNS  except  that  some  processors 
may  be  able  to  start  executing  some  tasks  for  the  next  phase  while  waiting  for  tasks  (from  other 
processors)  for  the  current  phase. 

The  above  scheduling  algorithm  still  has  one  problem:  we  often  need  to  move  tasks  over 
processors  to  balance  the  load  near  the  end  of  each  phase.  We  can  reduce  the  communication 
for  load  balancing  by  balancing  the  load  only  when  one  processor  falls  far  behind  another 
processor.  For  example,  suppose  that  the  fastest  processor,  say  P,,  has  some  tasks  in  a  phase 
<i>i,  while  the  slowest  processor,  say  Pj,  has  no  tasks  in  a  newer  phase  than  (i>j,  where  o,  >  Oj. 
If  <f>i  —  (f>ihr  >  <t>j,  we  say  Pj  falls  far  behind  Pi,  and  then  Pj  can  schedule  some  task  from 
processor  Pi,  where  <()thr  is  a  positive  threshold  given  by  the  programmer. 

To  implement  the  above  modified  scheduling  algorithm,  we  only  need  to  change  some 
priorities  as  follows.  For  each  task  on  each  processor  P„  we  add  (pthr  to  the  primary  key  of 
its  ith  priority.  Since  the  Jth  priority  of  the  task  stands  for  the  priority  of  scheduling  the  task 
locally,  a  processor  tends  to  schedule  local  tasks  with  high  priorities  unless  another  processor 
falls  far  behind. 


^The  index  of  a  phase  is  the  index  of  its  previous  phase  plus  one. 
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33  Discussion 


The  multilist  scheduling  schemes  proposed  in  this  chapter  demonstrate  that  our  model  can  be 
widely  applied  to  many  scheduling  algorithms.  These  examples  also  demonstrate  that  it  is  much 
easier  to  implement  scheduling  algorithms  using  multilist  scheduling  than  to  write  sophisticated 
scheduling  routines  from  scratch.  For  example,  in  our  experiments,  the  cock  for  the  PDC~WK 
and  PBFS-GPQ  scheduling  algorithms  (shown  in  Appendix  A.2)  only  has  about  10-20  lines. 
A  program  of  this  size  can  be  written  within  tens  of  minutes.  This  is  in  sharp  contrast  with 
previous  dynamic  load  balancing  programs,  which  would  typically  require  thousands  of  lines 
of  C  code.  This  was  the  case  in  our  earlier  experience  [31,  62]  in  parallelizing  Noodles,  a 
solid  modeling  program  [23].  It  took  us  months  to  write  the  load  balancing  part!  Since  our 
approach  can  greatly  shorten  the  time  of  implenwnting  a  scheduling  algorithm,  we  expect  more 
interesting  and  complicated  scheduling  algorithms  to  be  devised  and  implemented. 

Although  this  chapter  shows  the  simplicity  of  implementing  a  multilist  scheduling  scheme 
for  a  given  scheduling  algorithm,  we  offer  no  advice  on  how  to  come  up  with  the  scheduling 
algorithm  itself.  This  is  because  there  are  too  many  factors  which  can  affect  system  perfor¬ 
mance.  These  factors  includes,  for  example,  the  relative  importance  of  increasing  parallelism, 
minimizing  the  total  amount  of  computation,  minimizing  the  required  amount  of  communi¬ 
cation,  reducing  the  memory  requirement,  and  reducing  the  number  of  distinct  priorities.  We 
And  it  difficult  to  provide  a  simple  rule  for  assigning  priorities  based  on  all  the  factors.  So, 
we  leave  this  problem  open.  We  only  want  to  argue  that  it  is  simple  to  implement  a  multilist 
scheduling  scheme  for  a  given  scheduling  algorithm,  and  (in  the  next  chapter)  that  our  general 
approach  incurs  no  significant  performance  overhead.  In  addition,  since  it  is  so  easy  to  vary 
a  scheduling  algorithm  in  our  model  by  simply  adjusting  priorities  or  scheduling  patterns,  it 
becomes  easy  for  a  programmer  to  find  the  best  scheduling  algorithm  by  searching  the  design 
space  empirically. 
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Chapter  4 


Implementation  Issues 


In  addition  to  the  issues  of  simplicity  and  generality  of  the  multilist  scheduling  model  (as 
discussed  in  the  previous  chapters),  another  important  issue  is  to  efficiently  implemem  this 
model  such  that  our  general  approach  incurs  no  significant  performance  overhead.  Sections  4. 1 
and  4.2  will  respectively  propose  some  efficient  t^hniques  of  implementing  virtual  lists  (VLs) 
and  physical  lists  (PLs),  on  which  the  model  is  based.  Section  4.3  will  show  that  our  general 
approach  incurs  no  significant  performance  overhead  at  least  for  certain  important  scheduling 
algorithms. 


4.1  Maintaining  Virtual  Lists 


Since  VLs  are  conceptually  constructed  from  PLs  which  may  be  on  different  processors,  we 
need  to  maintain  VLs  via  interprocessor  communication.  Section  4.1.1  will  first  describe  the 
standard  protocol,  which  we  can  use  to  merge  PLs  into  VLs  in  all  cases.  Section  4.1.2  will 
describe  a  more  efficient  protocol,  called  the  global  protocol,  which  we  can  use  when  the 
scheduling  pattern  includes  a  global  scheduling  subpattem  (as  defined  in  Definition  3.  i ). 


4.1.1  Standard  Protocol 
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Figure  4.1:  An  example  of  the  standard  protocol  (PL[  is  merged  to  VLz). 

The  standard  protocol  is  applicable  whenever  there  is  a  need  to  merge  a  PL  into  a  VL  over  a 
network.  Figure  4.1  illustrates  an  example  in  which  a  PL,  say  PL\  on  processor  P\,  is  one 
of  the  PLs  merged  into  a  VL,  say  VLi  on  another  processor  Pi.  Let  denote  the  highest 
priority 'm  PL{.  We  can  straightforwardly  implement  the  standard  protocol  as  follows. 


•  Whenever  is  changed,  P\  reports  the  new  to  Pi. 

•  Pi  requests  a  task  from  PL  i  if  is  higher  than  the  priorities  in  any  other  PLs  merged 
into  VLi.  Then,  Pi  donates  a  task  from  PL\  to  Pi  (and  removes  the  task  from  all  other 
PLs  on  P| ). 


In  the  standard  protocol,  it  may  turn  out  that  a  PL  which  is  merged  into  a  VL  may  need  to 
report  to  the  processor  with  the  VL  too  frequently.  In  order  to  reduce  the  number  of  reports 
to  achieve  better  performance,  we  allow  the  programmer  to  provide  some  more  information  in 
the  following  two  ways. 


•  The  programmer  can  describe  the  known  range  of  priorities  in  each  PL.  Consider  the 
scheduling  pattern  in  Figure  4.2,  in  which  the  priority  range  of  PL\  is  between  and 
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Figure  4.2:  Omitting  reports  based  on  given  priority  ranges. 


7tu(>  Jt/)-  When  the  priority  of  some  task  in  PLz  is  greater  than  or  equal  to  \'Lz 
can  disable  all  reports  from  PLi  because  processor  Pi  will  not  need  to  schedule  a  task 
from  PL\.  For  example,  in  the  PDC-RR  scheduling  algorithm,  since  each  local  list  LL 
contains  only  non-negative  priorities,  and  each  global  list  GL  contains  only  non-positive 
priorities,  a  processor  does  not  need  to  report  the  maximum  priority  of  its  GL  to  any  other 
processors.  A  processor  will  request  a  task  from  a  GL  only  when  the  processor  becomes 
idle. 


Figure  4.3:  Indivisible  ranges  for  parallel  ANS  (no  report  when  the  phase  index  is  still  the 
same). 

•  The  programmer  can  dehne  indivisible  priority  ranges  for  each  PL,  such  that  all  the 
priorities  within  one  range  are  considered  as  one  priority  value  by  other  processors. 
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Thus,  the  PL  need  not  report  when  updating  within  the  same  range.  For  example, 
for  the  parallel  asynchronous  network  simulation  (ANS)  described  in  Section  3.2.4,  we 
need  to  balance  the  load  mainly  when  the  primary  priority,  the  phase  index,  is  changed. 
Since  it  is  not  important  to  let  the  other  processor  know  the  secondary  priority,  the 
programmer  can  declare  all  priorities  with  the  same  phase  index  to  lie  within  the  same 
priority  range,  so  that  PLs  will  avoid  making  reports  until  the  phase  changes,  as  shown 
in  Figure  4.3. 


4.1.2  Global  Protocol 

In  some  situations,  we  can  achieve  better  performance  by  using  a  different  protocol.  In 
particular,  let  us  consider  the  case  where  the  scheduling  pattern  includes  a  global  scheduling 
subpattem  (defined  in  Definition  3.1).  The  case  is  important  for  many  scheduling  algorithms 
(in  Chapter  3),  e.g.,  PBFS-GPQ,  PDC-WK,  the  quicksort  algorithm,  the  factoring  scheduling 
algorithm,  and  the  principle  variation  splitting  algorithm.  For  simplicity  of  discussion,  we 
consider  that  in  the  global  scheduling  subpattem  each  processor  P,  has  one  and  only  one  PL, 
denoted  by  PLi,  which  is  merged  into  each  VL  (see  Figure  3.1(b)).  Let  denote  the  highest 
priority  in  PL,  and  denote  mai(Tf*“®)  for  all  i.  For  a  scheduling  pattern  that  includes 
the  global  scheduling  subpattem,  the  standard  protocol  may  be  inefficient  in  the  following 
situations: 

•  Whenever  the  value  of  is  changed,  processor  P,  will  broadcast  a  message  to  all 

VLs  even  if  some  other  is  already  higher  than  The  broadcast  may  result  in 

unnecessary  communication  overhead. 

•  Many  processors  may  simultaneously  send  task  requests  to  the  processor  containing  the 
task  with  This  processor  which  receives  these  requests  will  “off-load”  more  than 
one  task.  Among  these  offloaded  tasks,  perhaps,  only  the  first  task  has  a  high  priority. 
For  example,  PL\  has  tasks  with  priorities  (4,  6,  8,  9)  and  PLi  has  tasks  with  priorities 
(1,2,  10).  If  P3  and  P4  need  to  request  tasks  at  the  same  time,  they  both  will  send  task 
requests  to  Pi.  Thus,  either  P3  or  P4  will  get  the  task  with  a  low  priority  of  2. 

In  order  to  improve  the  efficiency  when  there  is  a  global  scheduling  subpattern,  we  will 
introduce  a  central  mechanism  to  help  regulate  the  scheduling  protocol.  This  mechanism  is 
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called  a  globed  load  balancer  (GLB),  and  rhis  kind  of  scheduling  protocol  is  called  a  global 
protocol. 


In  Section  4. 1.2.1,  we  will  first  propose  a  simple  global  protocol  to  perform  load  balancing 
by  considering  only  those  tasks  with  the  highest  priority  If  there  are  enough  tasks  with 
priority  we  can  ignore  the  overhead  that  task  prioritization  causes.  However,  if  there  are 
not  enough  such  tasks,  task  prioritization  may  result  in  significant  overhead.  We  will  describe 
a  solution  to  this  problem  in  Section  4. 1 .2.2  and  describe  an  advanced  protocol  based  on  the 
solution  in  Section  4. 1.2.3. 


4.1JS.1  Simple  Global  Protocol 

For  a  global  protocol,  the  GLB  basically  wants  to  perform  load  balancing  by  considering  only 
those  tasks  with  the  highest  priority  Let  Tq  be  the  set  of  tasks  with  (Later,  we 
will  let  T 1  be  the  set  of  tasks  with  —  I.)  The  GLB  needs  to  keep  track  of  and  Tq 
in  order  to  balance  the  load.  So,  if  a  processor  sends  a  task  request  to  the  GLB,  the  GLB  can 
decide  which  task  in  Tq  to  donate. 

In  the  global  protocol,  the  GLB  needs  to  monitor  each  processor’s  status  and  keep  track  of 
Each  processor  P,  has  the  following  status: 


•  The  highest  priority  in  PLi. 

•  The  highest  priority  in  the  entire  system.  It  must  be  obtained  from  the  GLB 
because  only  GLB  can  gather  all  together  to  determine  the  value.  Since  some 
other  PLs  (e.g.,  the  local  list  for  PDC-WK)  may  also  be  merged  into  V'L,,  P,  needs  to 
compare  7r"*“  with  priorities  in  those  PLs  each  time  when  scheduling  tasks  from  its  VL. 
Therefore,  it  would  be  more  efficient  for  each  processor  to  have  the  value  locally. 

•  lJ":  The  load  of  those  tasks  in  Tq  and  on  Pi.  The  load  of  a  given  set  of  tasks  T  is 
IIt€t  where  G'r  is  the  grain  size  of  T.  Basically,  the  grain  size  of  a  task  represents 
the  amount  of  computation  for  the  task,  given  by  the  programmer.  The  grain  size  will 
be  defined  more  precisely  in  Section  4. 1 .2.3.  The  GLB  can  balance  the  load  by  donating 
tasks  from  To  to  an  idle  processor. 
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•  LJ'  :  The  load  of  those  tasks  in  T |  and  on  P„  where  Ti  is  the  set  of  tasks  with  priority 
«■”*“  —  1 .  The  GLB  can  balance  the  load  by  donating  tasks  in  T ,  while  To  is  empty.  The 
purpose  of  having  T|  is  to  make  the  transition  of  load  balancing  from  To  to  T,  smoother. 
If  we  do  not  do  this,  the  GLB  cannot  respond  to  the  idle  processor  while  To  is  empty, 
and  therefore  the  idle  processor  will  keep  idling. 


Now,  we  can  describe  a  simple  global  protocol  as  follows. 


1.  The  system  balances  the  load  round-by-round,  where  a  “round”  is  defined  as  a  period  of 
time  in  which  remains  the  same. 

2.  Each  processor  P,  reports  and  the  changes  of  lJ"  and  LJ'  to  the  GLB,  in  either  of 
the  following  two  situations. 

•  LJ°  or  LJ'  is  changed  “significantly”  (e.g.,  by  a  factor  of  two),  or 

•  is  changed  (i.e.,  a  new  round  is  started). 

3.  When  is  changed,  the  GLB  broadcasts  the  new  value  of  to  each  processor  to 
start  a  new  round.  Note  that  decreases  only  when  no  more  tasks  have  priorities 

TT  > 

4.  When  a  processor  requests  a  task,  the  GLB  tries  to  balance  the  load  by  donating  a  task  in 
To  to  the  processor.  If  Tq  is  empty,  the  GLB  will  try  to  donate  a  task  in  T,,  instead. 


An  important  feature  of  the  above  protocol  is  that  an  idle  processor  can  request  a  task  via 
as  few  as  three  “hops”;  (1)  the  idle  processor  issues  a  task  request  to  the  GLB,  (2)  the  GLB 
forwards  the  message  to  the  donor  (selected  by  the  GLB),  and  (3)  the  donor  donates  a  task  to 
the  idle  processor. 

Although  the  GLB  is  conceptually  centralized,  it  actually  uses  a  tree  structure,  which  can  be 
distributed  over  processors  as  illustrated  in  Figure  4.4  (note  that  this  need  not  be  a  binary  tree). 
The  main  purpose  is  to  prevent  a  single  GLB  process  from  becoming  a  bottleneck,  especially 
when  the  number  of  processors  is  large  (say  1000).  For  example,  in  the  global  protocol, 
broadcasting  and  collecting  information  would  make  a  single  GLB  become  a  bottleneck  given 
a  large  number  of  processors.  An  extra  advantage  for  such  a  tree  structure  is  that  load  balancing 
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Figure  4.4:  An  example  for  the  GLB  hierarchy. 


may  happen  in  parallel.  For  example,  as  shown  in  Figure  4.4,  if  processors  Pq  and  P:  ave  some 
tasks  in  To  and  processors  P\  and  P3  simultaneously  request  tasks,  Pq  and  Pz  can  respectively 
donate  tasks  in  T 0  to  P\  and  P3  at  the  same  time.  In  this  case,  each  task  request  requires  three 
“hops”. 


Definition  4.1  In  a  processor  tree  (as  illustrated  in  Figure  4.4),  a  COMBINING 
OPERATION  is  defined  as  follows.  Starting  from  the  leaf  processor  nodes,  each  node 
other  than  the  root  sends  one  packet  upwards  to  its  parent.  If  the  processor  node 
is  an  internal  processor  node  (other  than  the  root),  it  must  receive  all  the  packets 
from  its  children  before  sending  its  packet  to  its  parent. 

A  DISSEMINATING  OPERATION,  the  reverse  of  the  combining  operation,  is  defined 
as  follows.  Starting  from  the  root,  each  internal  processor  node  sends  a  packet 
downwards  to  each  of  its  children.  The  processor  node  (other  than  the  root)  sends 
a  packet  downwards  after  receiving  another  packet  from  its  parent. 


Definition  4. 1  defines  two  efficient  operations,  combining  and  disseminating,  in  a  processor 
tree.  The  combining  and  disseminating  operations  are  efficient  because  all  the  processors  at 
the  same  tree  level  can  be  processed  in  parallel  and  therefore  the  latency  is  only  the  time  spent 
in  processor  nodes  along  the  critical  path  among  all  paths  from  leaves  to  the  root.  In  addition, 
the  total  number  of  sends  (or  receives)  is  only  about  the  number  of  processors.  In  the  global 
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protocol,  collecting  processor  status  can  be  done  via  a  combining  operation,  while  broadcasting 
new  values  of  t”"*'  can  be  done  via  a  disseminating  operation. 


4.1JL2  Sparse  Priority  Distribution 

In  our  model,  load  balancing  becomes  most  complex  when  task  priorities  are  sparsely  dis¬ 
tributed,  i.e.,  when  the  total  load  of  tasks  with  each  priority  is  small.  In  this  situation,  if  we  still 
use  the  above  simple  global  protocol  to  balance  the  load,  the  total  load  of  tasks  in  Tq  (or  T|)  is 
usually  small.  This  implies  that  the  GLB  will  soon  need  to  update  the  value  of  That  is, 
the  GLB  needs  to  set  up  new  rounds  (including  broadcasting  new  7”“*'  and  collecting  the  load 
status)  very  often,  resulting  in  excessive  communication. 

The  key  idea  for  solving  this  problem  is  to  group  additional  highest-priority  tasks  into  To  or 
T  t  at  the  beginning  of  each  round  so  that  the  GLB  will  be  able  to  donate  more  tasks  from  these 
sets  and  will  not  have  to  rebroadcast  for  the  next  round  so  quickly.  Note  that  the  technique  of 
grouping  tasks  will  result  in  schedules  which  do  not  strictly  obey  the  order  of  priorities;  but  as 
mentioned  in  Chapter  2  the  program  correctness  should  never  rely  on  priorities. 

There  is  an  important  issue  for  grouping  tasks:  we  do  not  want  the  total  load  of  these 
grouped  tasks  to  be  too  small  or  too  large. 

•  If  the  total  load  is  too  small,  the  system  will  soon  need  to  set  up  a  new  round  again.  Thus, 
the  effect  of  grouping  extra  tasks  does  not  help  much  in  this  case. 

•  If  the  total  load  is  too  lai^e,  the  system  may  schedule  many  tasks  with  priorities  lower 
than  this  fails  to  follow  priority  well.  For  example,  for  BPS.  scheduling  many 
tasks  with  low  priority  could  end  up  wasting  a  significant  amount  of  computation  time. 

So,  the  rompromise  is  that  for  each  round  we  want  the  total  load  of  grouped  tasks  to  be 
comparable  to  a  load  threshold  Lthr  =  cLovh,  where  c  is  a  large  constant  and  L^vh  is  the 
aggregate  overhead  associated  with  setting  up  a  new  round  among  all  processors.  With  this 
compromise,  we  make  the  overhead  {Lovh)  less  significant  while  keeping  the  total  load  of 
extra  grouped  tasks  as  small  as  possible.  Since  setting  up  a  new  round  requires  a  disseminating 
operation  to  broadcast  the  new  value  of  and  a  combining  operation  to  collect  all  processors’ 
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load  status,  the  overhead  L<„h  is‘  0{p  logp)  sends/receives,  which  can  be  roughly  predicted  a 
priori. 

To  be  more  specific  about  the  compromise,  we  will,  during  each  round,  group  the  highest- 
priority  tasks  whose  total  load  is  Lgrj,  =  Q{LtKr),  if  the  total  load  of  tasks  in  the  whole  system  is 
Ltotai  =  ^(Ltkr)  and  each  task’s  grain  size  is  0{Lthr)-  If  the  total  load  of  tasks  is  Ltotai  <  Lthr, 
then  Lgrp  =  Ltotai  because  we  can  at  most  group  tasks  with  total  load  Ltotai-  If  the  maximum 
task  grain  size  Gmax  is  larger  than  Ltkr,  then  Lgrp  =  0{Gmax)  because  we  may  need  to  group 
the  task  with  Gmax  (e-g-t  when  the  task  with  the  grain  size  Gmax  has  the  highest  priority). 

In  Section  5.2,  we  will  design  an  efficient  algorithm,  called  the  parallel  range  selection 
(PRS)  algorithm,  on  a  processor  tree  with  a  constant  degree  (>  2);  this  algorithm  can  solve  the 
spane  priority  distribution  problem  and  satisfy  the  following  two  propenies: 

01  The  algorithm  only  requires  one  combining  (^ration  and  then  one  disseminating  operation. 
02  Each  packet  size  is  0(log^  p). 

In  the  above  problem,  we  allow  the  total  load  of  grouped  tasks,  Lgrp,  to  be  in  a  range 
Q{Lthr),  not  just  a  fixed  value  (say  Lthr),  for  the  following  two  reasons. 

•  It  appears  to  be  hard  to  implement  an  efficient  algorithm  to  group  the  highest-priority 
tasks  with  a  fixed  total  load  Lthr,  while  satisfying  Properties  0 1  and  02  simultaneously. 

•  It  is  not  critical  for  Lgrp  to  be  exact  because  the  environment  is  changing.  During  the 
period  when  we  group  tasks  for  To  or  T,,  the  load  status  on  each  processor  may  have, 
more  or  less,  been  changed  due  to  new  task  scheduling  and  task  creation. 


4.1.23  Advanced  Global  Protocol 


In  this  section,  we  will  modify  the  simple  global  protocol  by  adding  the  PRS  algorithm  (to 
be  described  in  Section  5.2)  with  properties  01  and  02,  such  that  the  protocol  can  also  cope 

'if  g(n)  =  0(f{n)),  there  .exists  some  positive  c  for  which  g(n)  <  cf(n)  for  all  sufficiently  large  ri. 

If  g(n)  =  Q(/(n)),  there  exists  some  positive  c  for  which  g(n)  >  cf[n)  for  all  sufficiently  large  n. 

Ifg(n)  =  e(f(n)),  then  g(n)  =  0(f(n))  and  g(n)  =  Q(f(n)). 
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with  the  problem  of  sparse  priority  distributions.  In  this  advanced  global  protocol,  the  PRS 
algorithm  is  used  to  group  a  set  of  additional  highest-priority  tasks  at  the  beginning  of  each 
load  balancing  round.  These  tasks  are  grouped  into  T  |  (i.e.,  grouped  into  the  Tq  of  the  next 
load  balancing  round  in  advance)  so  that  we  can  simultaneously  perform  the  PRS  operation 
and  balance  the  tasks  in  the  current  T q. 

For  the  new  design,  the  two  sets  To  and  Ti  are  changed  as  follows.  WedefineTo  =  Tq+Tq, 
where  Tg  is  the  set  of  tasks  with  priority  r”*"*  and  is  the  set  of  tasks  determined  by  the  PRS 
algorithm  in  the  previous  round.  Similarly,  we  define  T|  =  T',  +  T".  where  T',  is  the  set  of 
tasks  with  priority  -  1  (excluding  those  tasks  in  T o)  and  T"  is  the  set  of  tasks  determined 
by  the  PRS  algorithm  in  the  current  round.  In  fact,  and  T',  are  just  T o  and  Ti  in  the  original 
global  protocol. 

The  new  protocol  is  the  same  as  the  original  protocol  except  for  the  following.  At  the  very 
beginning  of  a  round,  each  processor  moves  tasks  from  T"  to  T^',  reports  processor  status, 
and  starts  doing  the  combining  operation  of  the  PRS  algorithm.  Since  the  status  report  also 
requires  a  combining  operation,  we  can  combine  the  two  combining  operations  in  order  to 
reduce  communication  overhead.  When  the  root  of  GLB  receives  all  the  reports,  it  will  check 
whether  the  total  load  of  tasks  in  T',  is  already  large  enough  (e.g.  greater  than  Lthr)-  If  so, 
we  do  not  need  to  apply  the  PRS  algorithm  to  group  tasks  for  the  next  round.  If  not,  we  will 
do  the  disseminating  operation  of  the  PRS  algorithm  to  group  more  highest-priority  tasks  into 
T".  This  provides  us  with  a  chance  to  omit  the  disseminating  operation  when  T',  is  already 
large  enough.  We  will  show  below  that  the  PDC-WK  scheduling  algorithm  can  usually  omit 
the  disseminating  operation. 

Before  examining  the  case  of  the  PDC-WK  scheduling  algorithm,  we  need  to  carefully 
define  the  task  grain  size.  We  define  the  grain  size  of  a  task  to  be  the  average  time  between 
the  moment  the  task  begins  executing  and  the  moment  the  corresponding  processor  requests 
the  next  task  from  another  processor,  not  Just  the  amount  of  computation  taken  for  the  task. 
For  example,  for  the  PDC-WK  scheduling  algorithm  the  grain  size  of  each  task  is  quite  large 
because  we  usually  exhaust  all  the  local  tasks  and  their  descendants  before  requesting  a  task 
from  other  processors.  For  PBFS-GPQ,  the  grain  size  of  a  task  is  about  the  amount  of  time 
for  computing  the  task  because  the  algorithm  is  very  likely  to  schedule  the  next  task  with  the 
highest  priority  from  another  processor.  Although  the  definition  of  task  grain  size  is  not  so 
straightforward,  we  argue  that  the  scheduling  programmer  can  help  the  application  programmer 
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to  calculate  the  task  grain  size. 


Now,  let  us  examine  the  case  of  the  PDC-WK  scheduling  algorithm  according  to  the  above 
definition.  Since  each  node’s  grain  size  is  very  large,  the  total  load  in  T',  tends  to  be  very  large 
as  long  as  T',  has  one  node.  Consequently,  the  disseminating  operation  is  usually  omitted,  and 
the  protocol  is  almost  the  same  as  the  original  one  in  this  case.  Our  experiments  presented  in 
Chapter  6  also  confirm  that  the  disseminating  operation  is  usually  omitted  for  PDC-WK. 


4.2  Maintaining  Physical  Lists 


Since  PLs  are  similar  to  priority  queues,  PLs  can  be  maintained  in  the  same  way  as  priority 
queues.  A  priority  queue  data  structure  often  requires  the  following  primitive  operations. 


lNSEKr(T,x‘):  Inserts  a  task  T  with  priority  w  into  the  priority  queue. 

Delete(T):  Deletes  a  task  T  from  the  priority  queue.  Note  that  in  the  multilist  scheduling 
model  if  one  task  is  scheduled  from  one  PL,  then  we  need  to  delete  the  instances  of  that 
task  from  other  PLs  on  the  same  processor  too. 

Maxpri():  Returns  the  highest  priority  from  the  queue. 

Deletemax();  Deletes  and  returns  the  task  with  the  highest  priority  in  the  priority  queue. 


As  described  in  Section  2.2,  since  there  may  be  some  derived  PL  based  on  this  PL  with  a 
monotonically  decreasing  priority  translation  function,  we  also  need  to  provide  MiNPRi  and 
Deletemin  primitive  operations.  This  is  because  the  highest  priority  in  the  derived  PL  is  the 
lowest  priority  in  the  base  PL. 

The  above  operations  are  called  DEQ  operations*  when  they  access/insert/delete  a  task  with 
the  highest  or  lowest  priority.  (The  Maxpri,  Deletemax,  Minpri,  and  Deletemin  operations 

^  A  “DEQ  operation”  is  our  preferred  name  for  what  others  (e.g..  [97])  have  called  a  “deque  operation",  i.e..  an 
operation  on  a  “double  ended  queue”,  which  is  accessed  only  at  its  two  ends.  We  avoid  the  term  “deque  operation" 
because  this  sounds  like  removal  from  a  queue. 
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are  always  DEQ  operations  while  the  Insert  and  DELETE  operations  are  sometimes  DEQ 
operations.) 

^or  the  PLs  involved  in  a  global  scheduling  subpattem,  we  may  also  need  to  provide  other 
operations  required  by  the  PRS  algorithm  (described  in  Section  5.2).  They  include; 


THRESHPRl(itfcr),  where  Lthr  >  0;  If  L(-oo)  <  Lthr,  Threshpri  returns  -oo  and  £.(-00); 
otherwise,  there  is  a  priority  ir  such  that  £(«•+!)<  Lar  <  L{ir),  and  Threshpri  returns 
5r  and  where  L{Tr)  =  for  all  tasks  T  with  priorities  tt'  >  tt.  Note  that  we 

can  substitute  Tr^in  -  1  for  -00,  and  +  I  for  oc,  where  (T^max)  is  the  highest 
(lowest)  priority  which  can  be  used.  The  PRS  algorithm  needs  to  use  this  operation  to 
obtain  the  priority  distribution. 

SPLrr(T);  Splits  the  priority  queue  into  two.  one  part  containing  all  the  tasks  with  priority 
tt'  >  IT  and  the  other  part  containing  all  the  other  tasks.  The  PRS  algorithm  will  need 
this  operation  to  select  all  the  tasks  with  priorities  greater  than  or  equal  to  a  threshold 
priority. 

Here,  we  should  note  that  if  a  global  scheduling  subpattem  uses  a  derived  PL  based  on 
another  PL  and  a  priority  translation  function  /,  the  derived  PL  needs  to  translate  its  priority 
threshold  ir  to  the  priority  in  the  base  PL  in  order  to  perform  the  SPLIT  operation.  Since  the 
function  /  only  translates  the  priorities  in  the  base  PL  to  those  in  the  derived  PL,  the  programmer 
needs  to  provide  the  inverse  of  function  /  to  translate  priorities  from  the  derived  PL  to  those 
in  the  base  PL.  If  function  /  is  linear  (as  in  the  PDC-WK  scheduling  algorithm  and  factoring 
scheduling  algorithm),  i.e.,  /(tt)  =  ott  +  6,  the  programmer  only  needs  to  specify  the  constants 
a  and  6;  the  system,  knowing  that  /  is  linear,  can  automatically  find  the  inverse  of  function  /. 

Now,  we  want  to  see  how  to  support  the  above  operations  for  derived  PLs  and  base  PLs. 
For  a  derived  PL,  we  can  perform  all  the  PL  operations  based  on  the  base  PL  and  the  priority 
translation  function  (maybe  including  the  inverse  of  the  function).  For  a  base  PL,  we  will, 
in  Section  4.2.1,  propose  an  efficient  priority  queue  supporting  the  above  operations.  This 
proposed  priority  queue  will  satisfy  the  following  two  properties. 

PI  The  worst-case  times  of  all  the  operations  are  0(log  n),  where  n  is  the  number  of  priorities 
in  the  priority  queue.  (Note  that  the  worst-case  time  of  the  operation  Maxpri  or  Minpri 
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is  only  0(1).) 

P2  The  amortized  times  of  all  the  DEQ  operations  is  0(  1 ).  The  amortized  time  [92]  is  defined 
as  the  average  time  of  an  operation  in  a  worst-case  sequence  of  operations. 


The  first  property  PI  implies  that  these  operations  are  efficient  for  the  worst  case.  The 
worst-case  time  is  important  because  it  may  shorten  the  response  time  for  interprocessor 
communication  operations.  For  example,  assume  that  the  worst-case  time  is  0(n).  Even  if  the 
amortized  time  is  much  less  than  0(n),  the  donation  of  a  particular  task  may  be  significantly 
delayed  due  to  some  operation  requiring  0{n)  computation  time.  Therefore,  the  performance 
may  become  bad  and  unpredictable. 

The  second  property  P2  implies  that  the  DEQ  operations  are  optimal.  This  property  is 
also  important  because  the  DEQ  operations  happen  in  many  cases.  As  mentioned  above,  the 
operations,  Deletemax,  Maxpri,  Deletemin  and  Minpri,  are  always  DEQ  operations.  In 
addition,  some  applications  tend  to  insert/delete  a  task  only  at  the  two  ends.  For  example,  in 
PDC-WK,  when  we  schedule  the  deepest  node  locally,  the  node  has  the  highest  local  priority, 
but  has  the  lowest  global  priority;  when  we  schedule  the  node  closest  to  the  root  from  all  the 
(other)  processors,  the  node  has  the  highest  global  priority,  but  has  the  lowest  local  priority. 


4.2.1  Data  Structure  of  Priority  Queue 

Our  priority  queues  are  based  on  2-3  trees  [3],  a  kind  of  balanced  search  trees,  which  have  the 
following  two  basic  properties. 

•  Each  internal  node  has  2  or  3  children  except  that  the  root  may  have  less  than  two  children 
in  the  case  that  there  are  less  than  two  nodes  in  the  whole  priority  queue. 

•  All  leaves  are  at  the  same  depth. 


From  the  two  above  properties,  for  a  2-3  tree  with  height  h  >  I ,  the  minimum  number  of  nodes 
is  2^  and  the  maximum  number  of  nodes  is  3^*.  Thus,  h  =  O(rogn),  where  n  is  the  number  of 
nodes  in  the  tree. 
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In  2-3  trees,  each  node  is  associated  with  a  priority  key;  each  leaf  node  is  also  associated 
with  those  tasks  whose  priorities  are  the  same  as  the  leaf’s  priority.  These  priorities  are  ordered 
as  follows.  For  each  internal  node  v  whose  priority  is  denoted  by  t  and  whose  ith  child  and  rih 
child’s  priority  are  denoted  by  j/,  and  tt,,  there  are  two  ordering  restrictions  for  these  priorities: 
(1)  T  <  TTi  for  all  i;  (2)  if  i  <  j,  all  the  priorities  of  nodes  in  the  subtree  rooted  at  Ui  must  be 
larger  than  those  at  Uj  (note  that  we  say  that  t/,  is  to  the  left  of  Vj).  An  example  is  illustrated 
in  Figure  4.5.  Because  of  the  second  restriction,  for  a  given  node  and  a  given  priority,  we  can 
easily  find  the  child  of  the  node  whose  subtree  may  have  the  leaf  with  the  priority.  We  can 
use  this  technique  to  search  for  a  leaf  whose  priority  is  adjacent  to  a  given  priority.  Since  the 
priorities  of  leaves  are  all  distinct,  we  store  all  tasks  with  the  same  priority  in  the  same  leaf. 

In  [3],  Aho  et  al.  proved  that  the  times  of  the  operations  Insert,  Delete,  Deletemax, 
Maxpri,  Deletemin,  Minpri,  and  Spur  are  only  0{  log  n )  because  each  of  these  operations 
only  traverses  a  path  between  the  root  and  some  leaf  at  most  downwards  once  and  upwards 
once.  First,  we  search  from  the  root  to  the  leaf  at  which  the  operation  of  accessing,  insertion, 
deletion,  or  splitting  takes  place.  Note  that  the  Delete  operation  does  not  need  to  search  from 
the  root  because  we  can  let  every  task  have  a  pointer  to  the  leaf  (with  the  same  priority  as  the 
task’s)  directly.  After  finding  the  leaf  i/,  we  perform  the  desired  operation  and  rebalance  the 
tree  if  necessary  to  ensure  that  it  is  still  a  2-3  tree.  This  requires  only  one  upward  pass. 

In  order  to  support  the  Threshpri  operation,  we  need  to  let  each  node  u  in  the  tree  have  a 
load  variable  L^.  (Note  that  this  is  similar  to  the  augmented  tree  described  in  [28,  Chapter  15].) 
If  is  a  leaf,  =  JZ  Gr  for  all  tasks  T  in  i/;  otherwise  u  is  an  internal  node  and  H 
for  all  children  u'  of  u.  An  example  is  illustrated  in  Figure  4.6,  in  which  the  number  in  the 
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Figure  4.6:  A  2-3  tree,  showing  load  variable  values. 


right  hand  side  of  each  rectangle  (i.e.,  task)  represents  the  grain  size  of  the  task,  and  the  number 
in  the  right  hand  side  of  each  circle  (i.e.,  tree  node)  represents  the  load  variable  value  of  the 
node.  So,  whenever  a  leaf  changes  its  load  (e.g.,  a  task  is  added  to  some  leaf),  we  can  just 
use  the  above  formula  to  update  the  load  variables  from  the  leaf  to  the  root.  In  addition, 
whenever  reconfiguring  the  tree  upwards,  we  may  also  need  to  calculate  the  load  variables  of 
these  touched  nodes  again  by  the  above  formula.  Thus,  the  other  operations  still  need  0(  log  n ) 
time. 

For  the  Threshpri  operation  with  a  given  load  threshold  Lthr,  if  the  total  load  is  smaller 
than  Lthr,  we  stop  and  return  the  total  load  and  the  threshold  priority  -oo;  otherwise,  we 
will  search  along  the  path  from  the  root  to  the  leaf  with  the  threshold  priority  satisfying  the 
operation.  For  example,  in  Figure  4.6,  if  Lthr  =  45,  we  will  return  the  total  load  39  and  the 
threshold  priority  -oo;  if  Lthr  =  15,  we  will  search  from  the  root  to  the  leaf  with  priority  5 
and  return  the  load  20  (=  8  +  5  +  1  +  6)  and  the  priority  5.  In  the  latter  case,  starting  from  the 
root,  we  repeatedly  visit  one  node  downwards  until  a  leaf  is  reached  such  that  for  each  visited 
node  u,  the  following  condition  holds:  Li  <  Lthr  <  Li  +  L^,  where  L^  is  the  load  variable  of 
1/  and  Li  is  the  total  load  of  tasks  in  all  the  subtrees  to  the  left  of  u  (i.e.,  the  total  load  of  tasks 
with  priorities  >  tt',  where  ir'  is  the  largest  priority  of  the  leaves  in  the  subtree  rooted  at  v).  It 
is  trivial  that-the  above  condition  holds  for  the  root,  initially.  For  example,  in  Figure  4.6,  for 
the  root,  since  1/  =  0  and  =  39,  the  above  condition  holds  (with  Lthr  =  15).  Then,  when 
visiting  the  node  u  on  the  path  with  the  above  condition,  we  can  guarantee  from  the  definition  of 
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load  variables  that  one  of  its  children  must  satisfy  this  condition  too.  For  example,  for  the  root 
in  Figure  4.6,  we  can  find  the  internal  node  i/  (with  priority  5)  satisfying  the  above  condition 
(with  Ltkr  =  15),  since  Z,/  =  13  and  Ly  =  7.  The  reader  can  verify  that  the  condition  holds  for 
the  leaf  with  priority  5,  too.  When  the  leaf  u  with  priority  tt  is  reached,  we  can  derive  the  result 
I(ff  +  1)  <  Lthr  <  Z(7r)  since  Li  <  Lthr  <  Li  +  Ly,  Li  ■¥  Ly  =  and  Lt  =  Z(t  +  1). 
So,  T  is  just  the  threshold  priority  for  the  function.  From  the  above,  the  Threshprj  operation 
only  needs  to  visit  nodes  O(logn)  times. 


Amortized  Time 

We  will  further  modify  the  data  structure  such  that  the  amonized  times  for  DEQ  operations 
satisfy  Property  P2.  For  insertion/deletion,  many  lesearchers  [47,  68]  have  proved  that  the 
amortized  time  for  rebalancing  the  tree  is  only  0{  1 ).  However,  DEQ  operations  may  still  need 
O(logn)  time  for  the  following  two  basic  procedures:  (1)  to  search  downwards  to  the  leaf  at 
which  the  access,  insertion,  or  deletion  takes  place  and  (2)  to  update  the  load  variables  upwards. 
So,  in  order  to  let  the  times  for  the  DEQ  operations  still  be  amortized  0(  \ ),  we  will  modify  the 
data  structure  to  reduce  the  computation  times  of  the  above  two  procedures  for  DEQ  operations 
to  (7(  1),  as  follows. 

First,  we  put  a  finger  at  each  of  the  two  ends;  i.e.,  the  priority  queue  has  two  special  fingers 
pointing  to  the  leaves  with  the  highest  priority  and  the  lowest  priority.  (Note  that  putting  fingers 
on  a  search  tree  [38,  98]  is  a  common  technique  for  finding  special  nodes  and  their  neighbors 
more  quickly.)  Hence,  for  DEQ  operations,  we  can  directly  find  the  leaf  from  the  two  fingers 
without  searching  downwards  from  the  root  to  the  leaf.  Thus,  for  each  DEQ  operation,  the  time 
for  searching  the  leaf  is  0{  I ).  In  addition,  with  the  same  technique,  the  total  worst-case  times 
of  the  Maxpri  and  Minpri  operations  are  only  0(  I ). 

Second,  we  allow  that  nodes  on  the  leftmost  and  rightmost  paths  do  not  need  to  have 
accurate  load  status,  where  the  leftmost  (rightmost)  path  is  the  path  between  the  root  and  the 
leaf  with  the  lowest  (highest)  priority.  So,  we  do  not  need  to  update  the  load  variables  of  the 
nodes  on  both  paths,  as  shown  in  Figure  4.7.  Thus,  for  all  the  DEQ  operations,  we  do  not  have 
to  update  the  load  upwards,  i.e.,  the  computation  time  for  updating  the  load  is  0(  1 ).  With  this 
modification,  each  Threshpri  operation  needs  to  update  the  load  status  of  nodes  on  both  paths 
(upwards)  first  and  then  perform  the  original  Threshpri  operation.  Thus,  the  new  Threshpri 
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operation  still  needs  (9(logn)  time. 


4.3  DLscussion 


Although  we  have  proposed  some  efficient  techniques  to  implement  the  multilist  model,  it  is  iilil 
very  difficult  to  argue  that  our  model  achieves  the  ultimate  goal:  for  all  scheduling  algorithms, 
our  general  approach  incurs  no  significant  overhead.  So,  we  will  leave  this  problem  open 
and  only  argue  that  our  general  approach  incurs  no  significant  performance  overhead  at  least 
for  the  following  four  important  scheduling  algorithms:  PBFS-GPQ,  PBFS-R,  PDC-WK,  and 
PDC-RR. 


1.  For  the  PBFS-GPQ  scheduling  algorithm,  researchers  in  (76]  also  used  a  centralized 
mechanism  to  maintain  the  global  priority  queue.  They  did  not  consider  the  sparse 
priority  distribution  situation  because  their  task  grain  size  was  coarse  enough  and  they 
had  relatively  few  processors.  But,  if  their  task  grain  size  were  small  or  if  they  had  many 
processors,  we  expect  that  our  advanced  global  protocol  could  be  used  to  solve  their 
problem.  So,  if  a  dedicated  design  for  PBFS-GPQ  followed  our  protocol,  our  general 
system  should  perform  almost  as  efficiently  as  the  dedicated  one. 

2.  For  the  PBFS-R  scheduling  algorithm,  Karp  and  Zhang  [57]  used  a  randomized  technique. 
The  corresponding  scheduling  scheme  based  on  our  model  (as  shown  in  Section  3. 1 . 1 .2) 
merges  PLs  into  local  VLs.  Because  of  this  locality,  our  model  will  perform  almost  as 
efficiently  as  a  dedicated  design  for  PBFS-R. 
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3.  For  the  PDC-WK  scheduling  algorithm,  Wu  and  Kung  [105]  used  a  global  pool  to  contain 
all  the  nodes  at  the  highest  available  tree  level  in  the  D&C  computation  tree.  Whenever 
some  processor  is  idle,  nodes  in  the  pool  can  be  donated  to  the  idle  processor.  In  ^act, 
this  algorithm  is  the  same  as  the  simple  global  protocol  except  for  using  T  |  (see  Section 
4.1  2.1).  Note  that  the  set  Tq  is  equivalent  to  the  global  pool.  In  our  system,  the  set  T  | 
is  used  to  make  the  transition  from  the  current  maximum  priority  to  the  next  maximum 
priority  smoother.  In  fact,  a  dedicated  design  for  parallel  D&C  may  also  benefit  from 
using  this  technique  of  adding  the  set  T  | . 

In  the  advanced  protocol,  we  may  require  an  extra  combining  and  disseminating  operation 
when  there  is  a  sparse  priority  distribution.  However,  for  PDC-WK,  we  can  assign  a  very 
large  grain  size  to  each  node,  such  that  the  total  load  of  tasks  in  T',  is  very  large  in  most 
cases.  Since  T  i  already  has  enough  load,  we  do  not  need  to  use  the  extra  disseminating 
operation  to  group  extra  tasks  into  Tp  Thus,  the  amount  of  communication  is  almost  the 
same  as  the  simple  global  protocol  or  a  dedicated  PDC-WK  algorithm. 

In  addition,  for  PLs,  the  PDC-WK  algorithm  always  inserts  and  deletes  a  task  with  the 
highest  priority  or  the  lowest  priority,  as  mentioned  earlier.  Thus,  in  our  system,  the 
amortized  times  for  these  DEQ  operations  are  only  0(  1).  In  a  dedicated  PDC-WK 
design,  we  can  use  a  doubly-linked  list  to  maintain  these  tasks  such  that  the  time  for 
each  DEQ  operation  is  only  (D(  1 ).  Our  system  may  still  not  be  as  good  as  the  dedicated 
design  which  uses  a  doubly-linked  list,  but  it  is  only  within  a  constant  factor.  In  the 
future,  we  may  even  provide  several  types  of  data  structures  (including  doubly-linked 
list)  and  allow  programmers  to  choose  their  preferred  data  structures.  From  the  above, 
we  can  conclude  that  our  general  system  incurs  no  significant  performance  overhead  for 
PDC-WK. 

4.  For  the  PDC-RR  scheduling  algorithm  in  Section  3. 1.2.2,  researchers  in  [34,  82,  44j  all 
used  the  following  strategy;  an  idle  processor  requests  a  task  from  another  processor  in 
a  round-robin  fashion.  Section  4.1.1  showed  that  our  system  can  judge  from  the  priority 
range  information  that  a  pre-selected  processor  does  not  need  to  report  the  status  of  its  GL 
to  its  destination  processors,  unless  its  destination  processors  become  idle  and  explicitly 
send  task  requests.  Thus,  the  algorithm  based  on  our  model  will  perform  almost  as 
efficiently  as  a  dedicated  PDC-RR  design. 

The  above  shows  that  our  system,  based  on  a  uniform  scheduling  model,  can  efficiently 
implement  the  above  scheduling  algorithms.  In  the  past,  it  has  been  difficult  for  any  single 
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scheduling  system  to  efficiently  support  the  scheduling  algorithms  for  both  parallel  BFS  and 
D&C.  For  example,  Manber  and  Finkel  in  [34]  provided  a  parallel  programming  system  for  both 
problems;  however,  they  also  pointed  out  that  it  is  difficult  for  them  to  use  a  uniform  framework 
to  efficiently  support  these  algorithms.  We  believe  that  our  system  is  the  first  system  that  can 
do  so. 
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Chapter  5 


Selected  Theoretical  Topics 


In  order  to  develop  our  multilist  scheduling  system,  we  have  also  studied  two  theoretical 
topics.  This  chapter  will  present  their  results.  First,  Section  5.1  will  present  the  communication 
complexity  for  parallel  divide-and-conquer  (D&C).  The  theoretical  results  in  Section  5. 1  were 
previously  published  in  [105],  Second,  Section  5.2  will  propose  an  efficient  algorithm  for  the 
parallel  range  selection  problem  that  will  be  used  in  the  implementation  of  our  model  (see 
Section  4. 1 .2.2).  Since  the  two  topics  in  this  chapter  have  no  strong  relation  to  each  other  or  to 
the  rest  of  this  thesis,  the  reader  can  skip  this  chapter  without  loss  of  continuity. 


5.1  Communication  Complexity  for  Parallel  D&C 


In  this  section,  we  will  theoretically  study  the  relationship  between  load  balancing  and  commu¬ 
nication  cost  for  performing  D&C  computations  on  a  parallel  system.  As  described  in  Section 
3.1.2,  D&C  is  a  common  computation  paradigm,  in  which  the  solution  to  a  problem  is  obtained 
by  solving  subproblems  recursively.  Each  node  in  the  tree  corresponds  to  a  problem  instance, 
and  children  of  the  node  correspond  to  its  subproblems.  During  the  computation,  each  internal 
(non-leaf)  node  goes  through  two  phases.  The  first  phase  is  the  divide  phase  during  which  the 
problem  instance  associated  with  the  node  is  divided  into  subproblems.  The  second  phase  is  the 
combine  phase  during  which  the  solution  of  the  problem  instance  associated  with  the  node  is 
derived  by  combining  solutions  of  the  subproblems  associated  with  the  node’s  children.  After 
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its  creation  each  leaf  will  perform  some  computation  and  return  the  results  to  its  parent.  At  a 
given  time,  nodes  on  a  wavefront  that  cuts  across  all  paths  from  the  root  to  leaves  can  be  active 
in  performing  divide,  combine,  or  compute  operations.  Along  each  path  the  wavefront  first 
moves  down  from  the  root  to  its  leaf  and  then  up  from  the  leaf  to  the  root. 

At  first  glance,  one  might  think  that  it  should  be  straightforward  to  perform  D&C  in 
parallel,  because  nodes  on  the  wavefront  can  all  be  processed  independently.  However,  if  one 
wants  to  achieve  good  load  balancing  between  the  processors,  then  parallelizing  D&C  becomes 
nontrivial.  In  fact,  doing  efficient  D&C  on  any  real  parallel  machine  has  been  a  major  challenge 
to  researchers  [33,  34,  82,  89]  for  many  years. 

The  difficulties  are  due  to  the  fact  that  many  D&C  computations  are  highly  dynamic  in  the 
sense  that  these  computations  are  data-dependent.  During  computation,  a  problem  instance 
can  be  expanded  into  any  number  of  subproblems  depending  on  the  data  that  have  been 
computed  so  far.  In  fact,  the  trees  of  many  D&C  computations  can  be  expected  to  be  sparse 
and  irregular,  and  as  a  result,  load  balancing  must  be  adaptive  to  the  tree  structure  and  must  be 
done  dynamically  at  run  time.  This  implies  that  computation  loads  need  to  be  moved  around 
between  processors  during  computation.  The  challenge  is  then  to  devise  efficient  scheduling 
algorithms  which  can  achieve  good  load  balancing  while  minimizing  the  communication  cost 
for  moving  computations  around. 

In  general  there  is  a  tradeoff  between  balancing  computation  loads  and  minimizing  com¬ 
munication  costs.  The  results  of  this  section  quantify  this  tradeoff.  In  particular,  this  section 
establishes  lower  bounds  on  the  communication  cost  for  any  scheduling  algorithm  based  on 
how  well  it  performs  load  balancing. 


5.1.1  Summary  of  Results 

5.1.1.1  Definitions  and  Notation 

The  tree  of  a  D&C  computation  is  called  a  (iV,  h,  d)~tree,  if 


•  iV  is  the  number  of  nodes  in  the  tree. 
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•  /i  is  the  height  of  the  tree,  and 

•  d  is  the  maximal  number  of  children  of  a  node.  (We  assume  that  d  is  at  least  2,  to  allow 
parallel  processing  of  the  tree.) 


A  node  is  said  to  be  at  tree  level  t  if  it  is  the  t-th  node  on  the  path  from  the  root  to  the  node. 
Therefore,  the  root  is  at  level  1,  and  the  height  of  the  tree  is  the  maximal  level  number. 

For  the  parallel  system  which  will  carry  out  the  D&C  computation,  we  assume  that 


•  p  is  the  number  of  processors  in  the  system,  and 

•  it  takes  one  time  step  for  a  processor  to  expand  a  node,  i.e.,  to  perform  the  divide  operation 
for  an  internal  node,  or  to  perform  the  compute  operation  for  a  leaf  node.  For  simplicity, 
we  assume  that  a  processor  takes  no  time  to  perform  a  combine  operation. 


When  a  node  is  expanded,  zero  or  more  children  may  be  generated.  More  precisely,  if  a 
node  does  not  generate  any  children,  the  node  is  a  leaf;  if  a  node  generates  one  or  more  (up  to 
d)  children,  the  node  is  an  internal  node.  Each  newly  generated  node  will  in  turn  be  expanded 
by  some  processor  in  the  future.  A  frontier  node  is  a  node  which  has  been  generated  but  has 
not  been  expanded. 

A  scheduling  algorithm  for  a  D&C  computation  schedules  nodes  (i.e.,  frontier  nodes)  on 
processors  for  expansion.  We  assume  that  scheduling  algorithms  cannot  “lookahead”.  This 
non-lookahead  assumption  is  reasonable  when  dealing  with  irregular  D&C  trees.  In  this  type 
of  tree,  the  number  of  children  a  parent  may  have  (if  any)  is  typically  data-dependent  and  is 
therefore  not  known  a  priori. 

The  parallel  computation  cost  T^(H)  of  n  scheduling  algorithm  A  for  a  D&C  computation 
tree  H  is  the  maximum  number  of  the  nodes  that  any  processor  may  expand.  Since  there  are  .V 
.  nodes  and  p  processors,  a  lower  bound  on  Ta(H)  is  T^m  =  f  ‘^/pl  •  The  parallel  computation 
cost  of  algorithm  A  is  defined  as  the  maximum  T^{H)  for  all  ( k,  (l)-trees  //. 

The  communication  cost  C>t(/f )  of  a  scheduling  algorithm  A  for  a  D&C  computation  tree 
H  is  the  total  number  of  cross  nodes.  A  cross  node  is  a  node  which  is  generated  by  one 
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processor  but  expanded  by  another  processor.  Note  that  the  processor  expanding  a  cross  node 
needs  to  receive  information  from  the  processor  generating  the  node.  Therefore,  is 

a  reasonable  measure  for  capturing  the  interprocessor  communication  cost  in  performing  the 
divide  phase  of  all  the  internal  nodes.  (A  similar  definition  of  communication  cost  is  used  by 
Papadimitriou  and  Ullman  in  [74].)  The  communication  cost  Ca  of  algorithm  A  is  defined  as 
the  maximum  Ca{H)  for  all  (iV,  h,  d)-trees  H. 


5.1.1,2  Main  Results 


Theorem  5.1  For  each  scheduling  algorithm  A  for  a  parallel  system  of  p  proces¬ 
sors,  for  each  integer  p',  0  <  p'  <  p,  and  for  each  N,  h,  and  d  with  the  following 
two  restrictions, 

51,  N  >  3p(Fh,  and 

52.  h  >  riogjiV]  +  flog^pd/i]  +  1, 

there  exists  some  {N,  h,d)-tree  H  for  which  at  least  one  of  the  following  two 
properties  is  true: 

Ql.  the  parallel  computation  cost  of  the  algorithm  is  Ta(H)  >  N'/p'; 

Q2.  the  communication  cost  of  the  algorithm  is  Ca(H)  >  C, 

where  N'  =  N  —  3p(fih,  C  —  p'k,  k  =  (d  —  l)h',  and  h'  =  h  —  [log,y  .V]  — 
\\og^pdh'\  -  1. 


Many  D&C  computations  are  expected  to  satisfy  restrictions  S I  and  S2.  Since  .\'  is  usually 
an  exponential  function  of  h,  restriction  S 1  is  easily  satisfied  in  these  cases.  Restriction  S2 
roughly  requires  that  N  <  d^~^/ph.  If  a  tree  is  perfectly  balanced  and  each  node  has  exactly  d 
children,  then  N  would  be  Q{d^~')  instead.  A  perfectly  balanced  tree  is  easy  for  load  balancing 
because  the  subtrees  of  each  node  have  the  same  computation  load.  Restrictions  S 1  and  S2 
basically  capture  those  interesting  D&C  computations  with  irregular  trees.  This  class  of  D&C 
computations  are  exactly  those  for  which  one  finds  it  difficult  to  achieve  good  load  balancing 
without  paying  much  in  communication  overheads.  The  lower  bound  on  C^(  ),  stated  in  Q2 

of  the  theorem,  provides  an  explanation  of  why  this  must  be  the  case. 
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The  two  properties  Q I  and  Q2  in  Theorem  5. 1  can  be  expressed  in  terms  of  the  quantities  .V, 
h,  d  (associated  with  the  D«&C  tree)  and  p  (associated  with  the  parallel  system)  as  follows.  One 
can  check  that  iV'  >  {I -ei^)N  and  h'  >  ( I -ca)A  for  each  positive  e,v  <  lande/,  <  1, provided 
that  h  >  and  ^  >  iV  >  (Note;  if  h  >  ^ 

then^  >  >  iV.thenc;,/i  >  log.^  yV+log^pd/i+3  >  +  Rogd  + 1  = 

h  -  h',  i.e.,  A'  >  ( 1  -  ^h)h\  if  iV  >  then  iV'  >  ( 1  -  e,v)iV.)  From  this  and  the  fact 
that  N'  <  N  and  h'  <  h,  we  note  that  N'  and  A'  approach  N  and  A  respectively,  when  both 
e,v  and  approach  0.  Therefore,  Q1  and  Q2  in  Theorem  5.1  become  Ta(H)  =  n(.V//?)  and 
Ca{H)  =  Q(pdh)  for  large  A,  when  p'  is  close  to  p.  Furthermore,  we  can  slightly  change  the 
theorem  as  Corollary  5.2. 


Corollary  5  J  For  each  scheduling  algorithm  for  a  parallel  system  of  p  processors, 
for  each  positive  ec  <  i.  which  can  be  arbitrarily  close  to  0,  there  are  values  of 
N,  A,  d,  p,  and  ct(  >  0),  for  which  if  the  parallel  computation  cost  is  between  j 
and  ( 1  +  then  the  communication  cost  must  be  at  least  (1  —  ec  )Cu,  where 

Ctj  “  pdht 


Proof.  Let  p  >  —  and  d  >  -f-.  Then,  let  ej  =  And,  let  .V  and  A  be 

‘  “  €(^  <r7  •'P 


in  the  range  as  shown  above  with  ^  and  c,v  =  One  can  check  that 

e*)C.  >  ( 1  -  f  )’C,  >(l-tc)C,  whenp’  =  p-\.  Thus,  if  ^  <  r,4  <  ( I  +«t)7 
(<  ^),  the  communication  cost  must  be  at  least  ( 1  -  tc)Cu-  Q 


Theorem  5.1  also  implies  an  important  tradeoff  result:  if  a  scheduling  algorithm  wants 
to  achieve  a  good  load  balancing  by  parallel  processing,  then  it  must  pay  a  high  price  in 
communication  cost.  We  can  express  the  tradeoff  between  Ta  and  Ca  explicitly  by  showing  a 
lower  bound  on  their  product:  Ta-{Ca  +  i^)-  If(p*— !)«:  <  O  <  p*k,  where  0  <  p'  <  p,  then 
by  Theorem  5 . 1 ,  must  be  at  least  N'/p‘ .  Therefore,  Ta-{Ca  +  k)  >  ( A'/p” )  •  p‘  k  =  .V'  . 

Note  that  because  of  Ta  >  ^jp  ^  Ip  this  tradeoff  is  also  satisfied  when  O  >  P^  -  This 
tradeoff  result  is  summarized  in  Corollary  5.3  below. 
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Corollary  S3  For  any  scheduling  algorithm  A  for  a  parallel  system  of  p  pro¬ 
cessors,  for  all  N,  h,  and  d  with  restrictions  SI  and  S2  as  defined  in  Theorem 
5.1. 

Tj,-{Ca^k)>N'-k, 


where  N'  and  k  are  defined  in  Theorem  5.1. 


Theorem  5.4  A  scheduling  algorithm  A  can  be  devised  to  have  the  property  that 
the  parallel  computation  cost  is  Ta  =  Tm.n  and  the  communication  cost  is  C a  < 
C„(=  pdh)  for  any  {N,  h,  d)-tree. 


The  algorithm  satisfying  Theorem  5.4  has  the  minimum  parallel  computation  cost.  By 
Corollary  5.2,  the  algorithm  is  optimal  with  respect  to  the  communication  cost,  since  the 
parallel  computation  cost  of  the  algorithm  is  near  optimal.  These  results  also  imply  that  the 
lower  bound  on  •  (C^  +  «)  in  Corollary  5.3  is  tight  when  both  €,v  and  th  are  arbitrarily  close 
to  0. 

Note  that  Theorems  5. 1  and  5.4  are  so  formulated  that  their  results  are  system-independent. 
That  is,  the  results  are  independent  from  the  interconnection  topology  of  the  processors  and 
various  control  overheads  such  as  data  structure  maintenance  and  reading/writing  messages. 
Therefore,  our  upper  and  lower  bounds  on  C'a  are  intrinsic  to  any  parallel  system.  These 
bounds  give  insights  into  actual  communication  cost  in  a  real  implementation,  but  exactly  how 
they  are  related  to  the  actual  cost  is  a  separate  matter  depending  on  the  implementation.  We 
have  investigated  this  actual  cost  by  implementing  the  algorithm  on  a  variety  of  interconnection 
networks  in  [104]. 

Section  5.1.2  describes  the  algorithm  of  Theorem  5.4.  Section  5.1.3  presents  a  simplified 
version  of  Theorem  5. 1  and  its  proof  to  help  the  reading  of  Theorem  5. 1 .  A  complete  proof  of 
Theorem  5. 1  is  given  in  Section  5. 1 .4. 
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5.1.U  Relation  to  Past  Work 


There  have  been  several  approaches  in  performing  parallel  D&C.  A  simple  approach  (e.g.,  in 
[7])  is  to  expand  ail  the  nodes  above  a  fixed  level  on  one  processor  and  then  distribute  nodes 
at  this  !rvel  to  other  processors.  Load  balancing  would  be  done  poorly  in  this  appro£u:h  when 
the  tree  is  irregular.  Another  approach  [89]  is  to  distribute  generated  nodes,  and  to  have  each 
processor  perform  load  balancing  based  on  load  status  information  from  its  neighbor  processors. 
For  this  scheme,  the  conununication  cost  can  be  very  high  in  the  worst  case. 

Recently,  some  researchers  have  made  efTorts  to  reduce  communication  overhead.  A 
popular  approach  [34,  82,  108]  is  based  on  the  “donate-highest-subtree”  strategy,  in  which  an 
idle  processor  will  be  given  frontier  nodes  as  near  to  the  root  as  possible.  Since  a  subtree 
rooted  near  the  top  usually  has  many  nodes  and  these  nodes  can  all  be  expanded  locally,  this 
strategy  tends  to  reduce  the  amount  of  interprocessor  communication.  Ferguson  and  Korf  [33] 
presented  a  D&C  scheme  with  several  processors  scheduled  first  to  a  node  and  then  to  their 
children.  The  idea  behind  their  scheme  is  also  that  of  distributing  frontier  nodes  near  the  root 
to  idle  processors. 

Although  the  methods  described  in  the  previous  paragraph  all  attempt  to  reduce  communi¬ 
cation  overhead,  they  do  not  use  global  information  to  balance  the  load.  It  turns  out  that  the 
communication  cost  for  these  methods  can  still  be  high  in  the  worst  case.  For  example,  we 
estimate  that  the  communication  cost  is  for  Ferguson  and  Korf’s  scheme,  and  is 

0{min{jrh.pdhr))  for  the  scheme  in  [34]  with  round-robin  scheduling. 

In  contrast,  the  communication  cost  for  the  scheduling  algorithm  here  (Section  5. 1 .2)  is  as 
low  as  0{pdh)  (Theorem  5.4).  This  is  partly  due  to  the  fact  that  our  algorithm  is  able  to  make 
effective  use  of  global  information  (i.e.,  “global  pool”  in  Section  5.1.2). 

Most  importantly,  we  note  that  none  of  the  previous  work  has  any  lower  bound  results 
on  the  communication  cost  for  parallel  D&C  computations.  It  appears  that  our  lower  bounds 
in  Theorem  5.1  and  Corollaries  5.2  and  5.3  are  the  first  lower  bound  results  for  those  D&C 
computations  whose  tree  structures  are  dynamic  in  the  sense  that  the  tree  structure  is  determined 
only  at  run  time.  Previous  results  on  computation  and  communication  cost  tradeoffs  such  as 
those  in  [51,  52,  74]  deal  with  only  static  computation  graphs,  whose  topologies  are  known 
before  the  computation  starts. 
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5.1.2  A  Scheduling  Algorithm  and  Upper  Bounds 


This  section  describes  a  new  scheduling  algorithm  which  can  achieve  the  upper  bounds  in 
Theorem  5.4  for  both  parallel  computation  cost  and  communication  cost.  The  bounds  hold  for 
any  D&C  computation,  i.e.,  for  any  {N,  h,  d)-ttee  no  matter  how  irregular  it  is. 


Proposed  Scheduling  Algorithm 

The  scheduling  algorithm  uses  a  data  structure,  called  a  Global  Pool  (abbr.  GP),  to  keep 
track  of  frontier  nodes  at  a  particular  tree  level  which  have  not  been  taken  by  any  processor  for 
expansion.  This  level,  identified  by  a  variable  gl,  has  the  property  that  nodes  at  higher  levels 
have  all  been  taken  by  processors.  Every  processor  will  try  to  take  a  node  from  the  GP  to  work 
on  whenever  it  becomes  idle.  For  the  proof  of  Theorem  5.4,  it  suffices  to  assume  that  the  GP 
is  maintained  by  some  single  processor.  (See  [104]  for  a  distributed  scheme  where  the  GP  is 
maintained  by  multiple  processors.) 

Initially,  the  GP  contains  only  the  root  and  the  value  of  gl  is  one.  The  GP  becomes  empty 
when  all  of  its  nodes  at  level  gl  have  been  taken  by  the  processors.  At  this  moment,  all  the 
processors  are  requested  to  send  in  their  frontier  nodes  at  level  gl  +  I  in  the  next  time  step 
when  all  the  nodes  at  level  gl  +  1  have  been  generated.  Then  the  GP  is  filled  with  this  set  of 
new  nodes,  and  gl  is  increased  by  one.  This  process  is  repeated  until  ail  the  nodes  have  been 
expanded. 

The  key  idea  of  this  algorithm  is  what  each  processor  will  do  after  it  has  taken  a  node 
from  the  GP.  The  processor  will  do  a  depth-first  traversal.  Consequently,  the  processor  can 
exhaust  all  possible  work  locally  before  asking  for  a  new  node  from  the  GP.  As  a  result,  we  can 
prove  (below)  that  the  communication  cost  can  be  as  low  as  Cu-  While  not  related  to  parallel 
computation  cost  and  communication  cost,  an  important  advantage  of  this  local  depth-first 
strategy  is  that  it  uses  the  minimum  amount  of  memory. 

In  essence  the  scheduling  algorithm  described  here  uses  a  breadth-first  scheme  to  distribute 
big  chunks  of  computations  to  processors,  and  has  each  processor  after  receiving  a  computation 
follow  the  depth-first  strategy  locally.  Therefore,  the  algorithm  is  a  hybrid  method,  which 
interestingly  will  do  a  purely  depth-first  traversal  of  the  tree  in  the  case  that  only  one  processor 
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is  used. 


Suppose  that  we  define  the  parallel  computation  time  to  be  the  time  (in  terms  of  number  of 
time  steps)  when  the  last  node  is  expanded  by  a  processor.  Then  the  parallel  computation  time 
of  the  algorithm  described  here  is  at  most  \N/p-\~h].  To  see  this,  we  note  that  some  processors 
may  become  idle  only  when  the  number  of  nodes  in  the  GP  is  smaller  than  the  number  of  idle 
processors.  In  the  worst  case  all  the  p  processors  may  become  idle  at  the  end  of  some  time 
step,  but  at  this  time  there  is  only  one  node  in  the  GP.  Thus,  in  the  next  time  step,  as  many  as 
p  —  1  processors  may  be  idle.  This  situation  can  happen  at  most  h  times.  Therefore,  in  the 
entire  D&C  computation,  additional  h{p  —  1)  nodes  could  have  been  expanded  if  there  were 
no  idle  processors  at  any  time  step.  This  implies  that  the  parallel  computation  time  is  at  most 
(■(iV  +  /i(p- l))/p]  <{N/p  +  h]. 

Note  that  parallel  computation  time  dehoed  in  the  previous  paragraph  is  different  from 
parallel  computation  cost  defined  in  Section  5. 1 . 1 . 1 .  Being  able  to  take  into  account  processor 
waiting  time  induced  by  inter-node  dependency,  parallel  computation  time  may  be  of  more 
practical  interest  than  parallel  computation  cost. 

However,  to  prove  Theorem  5.4,  we  need  to  establish  an  upper  bound  on  the  parallel 
computation  cost  of  the  algorithm.  We  will  do  this  and  also  establish  an  upper  bound  on  the 
communication  cost  of  the  algorithm. 


Figure  5.1:  At  most  d  frontier  nodes  at  each  level  on  a  processor  (d  =  3). 


Proof  of  Theorem  5.4.  To  achieve  the  f  N  /p]  upper  bound  on  parallel  computation 
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cost,  we  will  need  to  add  some  fair  scheduling  feature  to  the  algorithm  described 
above.  Whenever  the  number  of  nodes  in  the  GP  is  smaller  than  the  number  of 
idle  processors,  we  will  select  the  active  processors  for  the  next  time  step  from 
all  the  p  processors  in  a  fair  way.  That  is,  processors  take  turn  to  become  active 
using  a  round-robin  scheme.  This  ensures  at  the  end  of  any  time  step  that  the  total 
number  of  nodes  expanded  by  a  processor  so  far  will  not  exceed  that  expanded  by 
any  other  processor  by  more  than  one.  Thus  when  all  the  N  nodes  are  expanded, 
each  processor  will  have  expanded  at  most  f.V/p].  This  proves  that  the  parallel 
computation  cost  of  the  scheduling  algorithm  with  the  fair  scheduling  feature  is  at 
most  fiV/p]. 

The  communication  cost  of  the  algorithm  is  at  most  the  number  of  frontier 
nodes  entering  the  GP,  as  this  represents  the  only  interprocessor  communication 
activity  for  the  entire  algorithm.  Since  by  using  depth-first  search  each  processor 
has  at  most  d  local  nodes  at  each  level  (as  illustrated  in  Figure  5. 1 ),  the  GP  can 
collect  at  most  pd  nodes  each  time  that  yi  increases.  This  wii!  happen  at  most  h 
times,  so  the  total  number  of  nodes  entering  the  GP  is  bounded  above  by  C„  =  pdh. 

0 


Note  that  in  a  practical  implementation,  the  fair  scheduling  feature  may  not  be  used  since 
minimizing  parallel  computation  cost  may  not  be  important.  Without  the  fair  scheduling  feature, 
the  parallel  computation  cost  would  become  fiV/p  -i-  /i] .  However,  the  communication  cost  can 
be  reduced  to  p(d  —  1  )h,  if  a  processor  right  after  expanding  a  node  will  schedule  one  child,  if 
any,  of  the  node  for  expansion  at  the  next  time  step. 

The  scheduling  algorithm  described  in  this  section  is  being  used  as  a  basis  for  developing 
a  parallel  programming  model  for  D&C  computations.  To  obtain  practical  insights,  we  plan  to 
implement  a  programming  system  based  on  the  model  on  the  26-host  Nectar  network  system 
[6]  developed  at  Carnegie  Mellon  University. 
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5.13  A  Simplified  Version  of  Theorem  5.1 


This  section  presents  Theorem  5.5  (see  beiow),  which  is  a  simplified  version  of  Theorem  5.1 
dealing  with  only  two  processors.  A  relatively  simple  proof  of  Theorem  5.5  is  given.  This 
simple  proof  captures  the  essence  of  a  more  complicated  proof  of  Theorem  5. 1  given  in  Section 
5.1.4.  It  is  advised  that  the  reader  read  this  simple  proof  first  to  understand  the  ideas. 


Theorem  5.5  For  each  schedui  'g  algorithm  A  for  a  parallel  system  of  two  pro¬ 
cessors,  for  each  N,  h,  and  d  with  the  following  three  restrictions, 

51.  N  >  3dh, 

52.  h  >  flog^  iV]‘  +  2,  and 

53.  h  —  [log^  N]  —  2  is  an  even  integer, 

there  exists  some  {N,h,d)-tree  H  for  which  at  least  one  of  the  following  two 
properties  is  true: 

Tl.  the  parallel  computation  cost  of  the  algorithm  is  Ta[H)  >  .V  -  3dh; 

T2.  the  communication  cost  of  the  algorithm  is  Ca{H)  >  h'(d  —  1). 
where  h'  =  {h  —  [log^  /V]  —  2)/2. 


Note  that  restrictions  S 1  and  S2  correspond  to  those  in  Theorem  5. 1 .  Restriction  S3  is  for 
a  minor  technical  convenience,  namely,  ensuring  that  h'  an  integer. 

Theorem  5.5  implies,  for  example,  that  if  the  communication  cost  is  small  (in  the  sense  that 
T2  does  not  hold),  then  the  parallel  computation  cost  must  be  large  (in  the  sense  that  T 1  holds). 
In  particular,  if  Ca(H)  <  h'[d  —  1)  and  if  3dh  <C  N,  then  the  parallel  computation  cost  will 
be  close  to  N. 


Proof  of  Theorem  5.5 
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Suppose  that  we  are  given  a  scheduling  algorithm  A  for  performing  a  D&C  computation  on 
processors  Pi  and  Pz-  For  algorithm  A,  we  will  prove  the  existence  of  a  (.V,  /i,  d)-tree  H  for 
which  at  least  one  of  T1  and  T2  must  hold. 

By  playing  an  adversary  game  with  algorithm  A,  we  will  construct  the  tree  by  growing  it 
from  the  root  one  step  at  a  time.  A  time  step  consists  of  two  phases,  node  scheduling  phase  and 
node  expansion  phase.  In  the  node  scheduling  phase,  algorithm  A  schedules  a  node  or  no  node 
for  each  processor  to  execute.  Then,  in  the  node  expansion  phase,  these  scheduled  nodes  are 
expanded.  In  this  phase  we  will  determine  the  number  of  children  each  scheduled  node  will 
generate. 

We  will  first  define  a  special  class  of  subtrees  which  will  be  used  to  describe  some  sufficient 
conditions  under  which  a  tree  can  grow  to  a  ( jV,  h,  d)-tree.  We  will  then  give  the  main  part  of 
the  proof  including  a  description  of  the  tree  construction  procedure. 


HFD-Subtree 


Definition  5.1  At  any  given  time  during  the  tree  construction,  a  Hi^h-and-Full- 
Degree  subtree  (abbv.  HFD-subtree)  is  a  subtree,  which  is  rooted  at  a  node  at 
or  above  level  h  —  flog^  .V],  and  which  has  been  constructed  using  the  following 
rules: 

Al.  nodes  above  level  h  generate  d  children;  and 
A2.  nodes  at  level  h  generate  no  children. 


Note  that  rules  A I  and  A2  imply  that  a  node  which  is  above  level  h  and  has  no  children  must 
be  a  frontier  node. 


Lemma  5.1  At  any  given  time  during  the  tree  construction,  if  the  current  tree 
satisfies  the  following  four  properties: 
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11.  the  total  number  of  generated  nodes  is  at  most  N  —  h  —  d  (generated  nodes 

include  the  root); 

12.  the  height  is  at  most  h; 

13.  the  degree  of  any  node  is  at  most  d;  and 

14.  the  tree  contains  an  HFD-subtree, 

then  a  construction  procedure  can  be  devised  to  grow  the  tree  to  a  (N,h,  d)-tree: 


Proof.  We  first  note  that  in  the  HFD-subtree  of  14  there  exist  nodes  which  are 
above  level  h  and  have  no  children.  Otherwise,  the  subtree  would  have  been  “fully 
grown”  to  level  h,  according  to  rules  A 1  and  A2.  Since  its  root  is  at  and  above  level 
h  —  flog,^  this  fully  grown  HFD-subtree  would  have  at  least  .V) 

nodes.  This  contradicts  1 1 .  As  noted  above,  those  nodes  in  the  current  HFD-subtree 
which  are  above  level  h  and  have  no  children  must  all  be  frontier  nodes. 

Let  H\  be  the  current  tree.  We  will  identify  a  set  of  “padding  nodes”  which 
can  be  added  to  H\  to  make  it  a  ( N,  h,  d)-tree. 


Figure  5.2:  Growing  the  current  tree  to  a  ( A',  h.  d)-tree. 

If  Hi  has  height  less  than  h  or  degree  less  than  <1,  we  will  grow  it  by  extending 
the  current  HFD-subtree  from  one  of  its  frontier  nodes  which  are  above  level  h .  Let 
V  be  this  frontier  node,  as  shown  in  Figure  5.2.  We  generate  d  children  for  r  and 
create  a  path  from  u  to  a  node  at  level  h,  as  shown  in  Figure  5.2  (a).  The  resulting 
tree,  called  Hi,  has  height  h,  degree  d,  and  no  more  than  ( .V  -  h-d)+d  +  h  =  .V 
nodes. 

If  Hi  has  less  than  .V  nodes,  we  will  pad  it  with  nodes  in  the  fully  grown  HFD- 
subtree  which  are  reachable  from  the  current  frontier  nodes  and  other  padding 
nodes,  as  illustrated  in  Figure  5.2  (b).  Since  the  fully  grown  HFD-subtree  has 
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at  least  N  nodes,  it  has  sufficient  nodes  which  can  be  added  to  H2  to  make  it  a 
{N,h,d)-tKt. 

After  having  identified  all  these  padding  nodes,  we  now  have  a  “blueprint  tor 
a  construction  procedure  to  follow.  More  precisely,  the  construction  procedure  will 
just  generate  all  those  padding  nodes  in  the  dark  region  in  Figure  5.2  (b).  [] 


Main  Part  of  Proof  of  Theorem  5,5 

The  tree  construction  procedure  consists  of  three  stages.  Each  stage  uses  an  independent 
set  of  rules  in  constructing  the  tree. 


2h'+2 

[log^N^ 


Figure  5.3:  Two  areas  in  the  constructed  tree. 


In  stage  I,  we  expand  each  node  with  exactly  d  children.  Stage  1  terminates  at  time  T\ 
when  a  total  of  2h'  or  2h'  +  1  nodes  have  Just  been  expanded.  (Note  that  at  this  time  the  tree  is 
completely  inside  area  1  of  Figure  5.3.)  Since  the  number  of  frontier  nodes  increases  by  d  -  \ 
each  time  when  a  node  is  expanded,  there  are  exactly  2h'(d  -  1 )  +  1  or  (2/i'  +  1  )(r/  -  1 )  +  1 
frontier  nodes  at  time  T \ .  Without  loss  of  generality,  we  assume  that  processor  P\  has  generated 
at  least  h'{d  —  1 )  frontier  nodes. 

Stage  2  starts  right  after  T\.  In  this  stage  every  node  above  level  h  expanded  by  processor 
P\  will  have  d  children,  while  every  node  at  level  h  or  expanded  by  processor  A  will  have 
no  children.  Stage  2  terminates  at  time  Ti  when  one  of  the  following  two  conditions  becomes 
true: 


Cl  At  least  h'{d  ~  1 )  cross  nodes  have  been  scheduled. 
C2  At  least  N  —  h  —  2d  nodes  have  been  generated. 
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The  following  shows  that  Cl  or  C2  must  become  true  sometime,  i.e.,  Tz  exists.  Recall  that 
by  the  end  of  stage  1  processor  Pi  has  generated  at  least  h'{d  —  1)  frontier  nodes.  In  stage 
2  processor  Pi  will  generate  nodes  in  the  subtrees  rooted  at  those  frontier  nodes  which  are 
still  in  Pi.  For  each  of  these  subtrees,  since  its  root  is  in  area  1  of  Figure  5.3,  the  subtree  can 
have  at  least  N  —  h  —  2d  nodes  unless  some  of  these  nodes  are  moved  to  processor  P;  from 
processor  Pi.  If  Cl  does  not  hold,  then  fewer  than  h'{d  —  1 )  nodes  can  be  moved  from  P|  to 
Pz-  Consequently,  some  subtree  will  have  at  least  N  -  h  -  2d  nodes,  and  thus  C2  will  be  true. 

Stage  3  staits  right  after  time  Tz-  Lemma  5.2  below  shows  that  properties  11-14  of  Lemma 
5.1  hold  for  the  tree  at  time  Tz.  In  stage  3,  we  follow  the  procedure  described  in  the  proof  of 
Lemma  5. 1  to  grow  the  tree  to  a  ( iV,  /i,  d)-tree. 


Lemma  5.2  At  any  time  in  stage  I  or  2.  including  time  Tz,  the  tree  satisfies 
properties  11-14  of  Lemma  5.1. 


Proof.  It  is  obvious  from  the  descriptions  of  stages  I  and  2  that  12  and  13  are 
satisfied.  For  II,  we  note  that  the  total  number  of  nodes  generated  in  stage  1  is  at 
most  {2h'  4-  1  -I-  1,  and  thus  at  most  N  —  h  -  dby  restriction  S 1  of  Theorem  5.5, 

In  stage  2,  II  obviously  holds  when  C2  is  not  true.  Suppose  that  C2  becomes  true 
at  time  Tz.  Since  the  tree  has  no  more  than  y  —  h  -  2d  nodes  in  the  previous  time 
step  and  since  at  most  d  nodes  can  be  generated  (in  processor  P\ )  in  one  time  step, 
there  are  at  most  N  -  h  —  d  nodes  at  time  Tz. 

Property  14  clearly  holds  for  stage  1  by  examining  its  description.  It  remains  to 
prove  that  14  holds  for  stage  2.  The  proof  is  similar  to  the  earlier  proof  of  the  fact 
that  Cl  or  C2  must  become  true  in  stage  2.  Recall  that  in  stage  1  processor  P|  has 
generated  at  least  h'{d—\)  frontier  nodes.  We  note  that  any  of  these  subtrees  rooted 
at  these  nodes  is  an  HFD-subtree  if  the  subtree  does  not  contain  any  expanded  cross 
node.  Since  the  number  of  cross  nodes  expanded  (not  just  scheduled)  through  time 
Tz  is  less  than  h.'{d  -  1 ),  one  of  these  subtrees  must  be  an  HFD-subtree.  Note  that 
if  C2  becomes  true  at  time  Tz  (in  the  node  scheduling  phase),  the  node  scheduled 
has  not  been  expanded.  [] 
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To  complete  the  proof  of  Theorem  5.5,  we  observe  that  if  Cl  becomes  true  at  some  time  in 
stage  2  or  3,  it  will  remain  true  for  the  rest  of  the  tree  construction  process.  Therefore  property 
T2  of  Theorem  5.5  will  hold  for  the  final  {N,  A,  d)-tree. 

Now  assuming  that  C 1  never  holds  at  any  time  in  stage  2  or  3,  we  want  to  show  that  property 
T1  of  Theorem  5.5  will  hold  for  the  final  (N,  h,  d)-trce.  We  derive  an  upper  bound  on  the  total 
number  of  nodes  expanded  by  processor  P-y.  The  upper  bound  is  the  sum  of  four  terms  U\,  Ui, 
Uj  and  Ui.  In  stage  1,  processor  Pi  has  expanded  ^  most  Uy  =  2/i'  +  1  nodes.  At  time  T\, 
processor  Pi  can  have  generated  up  to  (A'  +  1  )(d  -  1 )  +  1  frontier  nodes,  each  of  which  can  be 
expanded  at  most  once  by  processor  P2  in  stage  2  or  3.  It  is  also  possible  for  processor  Pi  to 
expand  nodes  which  are  generated  by  Pi  but  subsequently  moved  to  Pi.  The  total  number  of 
these  nodes  is  at  most  C a{H)  <  Uz  =  h'{d  -  \  ).  Moreover,  to  take  care  of  the  nodes  generated 
after  T2  in  stage  3,  processor  P2  may  expand  up  to  LU  <  h  +  2d  nodes.  Therefore  the  total 
number  of  nodes  expanded  by  processor  P2  is  at  most  U  =  U\  +  i'z  +  (  "3  +  t  '4  <  3d/i.  This 
implies  that  processor  Pi  has  expanded  at  least  N  —  U  =  N  —  3dh\  that  is,  property  T1  holds. 

D  ■ 


5.1.4  Proof  of  Theorem  5.1 


Suppose  that  we  are  given  a  scheduling  algorithm  A  for  performing  a  D«fcC  computation  on  a 
parallel  system  of  p  processors.  For  algorithm  A,  we  will  prove  the  existence  of  a  ( N.  h ,  d)-tree 
H  for  which  either  only  p'  processors  are  active  for  expanding  most  of  nodes  (at  least  .V'  nodes) 
or  at  least  C  nodes  are  moved  between  processors  to  balance  their  computation  loads.  For  the 
former,  the  parallel  computation  cost  will  be  high,  i.e.,  T^iH)  >  N'/p'  (property  Q 1 ).  For  the 
latter,  the  number  of  cross  nodes  will  be  large,  i.e.,  Ca(H)  >  C  (property  Q2). 

By  playing  an  adversary  game  with  algorithm  A,  we  will  construct  the  tree  by  growing  it 
from  the  root  one  step  at  a  time.  The  definition  of  time  step  is  the  same  as  that  in  the  proof  of 
Theorem  5.5. 

We  will  give  some  more  definitions  in  Section  5. 1.4.1  and  then  give  the  main  part  of  this 
proof  in  Section  5. 1 .4.2.  All  the  related  lemmas  are  in  Section  5. 1 .4.3. 
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5.1.4.1  Definitions 


To  help  derive  a  lower  bound  on  the  number  of  cross  nodes,  we  introduce  the  following  relation 
between  subtrees. 


Definition  5,2  A  set  of  subtrees  is  processor-or-ancestry  independent  (abbr.  PA- 
independent)  if  for  each  pair  of  subtrees  in  the  set  at  least  one  of  the  following  two 
properties  is  satisfied: 

1.  Processor  Independence:  the  roots  of  these  two  subtrees  are  generated  on 
different  processors; 

2.  Ancestry  Independence:  neither  is  a  subtree  of  the  other.  That  is,  there  is  no 
ancestor-descendant  relationship  between  the  two  roots. 


Note  that  for  two  PA-independent  subtrees  rooted  at  nodes  ri  and  r^,  if  node  ri  is  an  ancestor 
of  node  r-i,  then  both  nodes  must  be  generated  on  different  processors.  This  implies  that  there 
must  exist  at  least  one  cross  node  on  the  path  from  node  r\  (inclusive)  to  the  parent  (inclusive) 
of  node  ri.  Therefore,  from  this  property,  if  there  are  k  PA-independent  subtrees  each  of  which 
has  at  least  one  expanded  cross  node,  then  there  are  at  least  k  expanded  cross  nodes  in  the  tree. 
This  is  shown  in  Lemma  5.3  (in  Section  5. 1.4.3). 


Definition  53  An  HFDC-subtree  is  an  HFD-subtree  (as  defined  in  Definition  5. 1 ) 
or  a  subtree  with  at  least  one  cross  node  already  expanded.  If  the  root  of  an  HFDC- 
subtree  is  generated  on  processor  P,  the  subtree  is  called  an  HFDC-subtree  on 
processor  P. 


By  Lemma  5.3  and  Definition  5.3,  if  there  are  k  PA-independent  HFDC-subtrees  and  fewer 
than  k  expanded  cross  nodes,  then  there  exists  an  HFD-subtree,  as  shown  in  Lemma  5.4.  We 
will  use  this  lemma  to  show  the  existence  of  an  HFD-subtree  during  some  periods  of  the  tree 
coristniction  procedure. 
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State  1  => 

Apply  the  following  four  rules: 

Rl.  Nodes  in  area  I  (shown  in  Figure  5.5)  will  generate  d  children. 

R2.  Cross  nodes  in  areas  2  and  3  (shown  in  Figure  5  J)  will  not  generate  any  children. 

R3.  Non-cross  nodes  in  areas  2  and  3  (excluding  level  h)  will  generate  d  children. 

R4.  Nodes  at  level  h  will  not  generate  any  children. 

Repeat  rules  R1-R4  until  time  T\  when  any  of  the  following  three  conditions  holds: 

Cl.  For  some  pf  processors,  at  least  h'  non-cross  nodes  have  been  expanded  on  each  processor. 
C2.  At  least  C  cross  nodes  have  been  scheduled. 

C3.  At  least  N  —  (pd  +  d  +  h)  nodes  have  been  generated. 

Stage  2  (continued  from  time  T\  when  C 1  holds)  => 

Find  a  set  r  of  p'  processors  with  the  following  two  properties: 

Bl.  There  are  at  least  C  PA-independent  HFDC-subtrees  in  F. 

B2.  There  are  at  most  />'  non-cross  nodes  expanded  on  each  of  the  other  p  -  p'  processors 
in  the  set  F. 

Apply  the  following  three  rules: 

RS.  Nodes  (excluding  those  at  level  h)  in  F  will  generate  d  children. 

R6.  Nodes  in  F  will  not  generate  any  children. 

R7.  Nodes  at  level  h  will  not  generate  any  children. 

Repeat  rules  R5-R7  until  time  T2  when  either  of  the  following  two  conditions  holds: 

C4.  At  least  C  cross  nodes  have  been  scheduled. 

CS.  At  least  /V  -  {pd  +  d  +  h)  nodes  have  been  generated. 

Stage  3  (continued  from  time  T\  when  C2  or  C3  holds  or  from  time  T2  when  C4  or  C5  holds.)  => 

Use  the  construction  procedure  described  in  the  proof  of  Lemma  5. 1  to  grow  the  tree 
to  a  (^,  /i,d)-tree. 


Figure  5.4:  Tree  construction  procedure. 
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Figure  5.5:  Three  areas  in  the  constructed  tree. 
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5.1^^  Main  Part  of  Proof  of  Theorem  5.1 


The  tree  construction  procedure  consists  of  three  stages.  Basically,  this  procedure,  summarized 
in  Figure  5.4,  is  similar  to  that  in  Section  S.1.3.  The  main  difference  is  that  in  stage  1  of  this 
procedure  we  uses  more  sophisticated  rules  to  prove  a  better  lower  bound  of  the  number  of 
cross  nodes.  (Note  that  if  A  >  log^  N,p-1,  and  p'  ~  1,  the  lower  bound  of  communication 
cost  in  this  theorem  is  approximately  twice  as  large  as  that  in  Theorem  5.5.) 

In  stage  1,  we  will  repeatedly  apply  rules  RI-R4  (in  Figure  5.4)  until  time  Ti  when  one 
of  the  conditions  C1-C3  holds.  Rules  R1-R4  ensure  that  each  subtree  rooted  in  area  1  or  2 
is  always  an  HFDC-subtree  because  in  constructing  the  subtree  either  rules  A1  and  A2  are 
followed  (using  Rl,  R3,  and  R4)  or  some  cross  nodes  are  expanded  (using  R2).  Basically,  the 
procedure  in  stage  1  attempts  to  produce  at  least  C  PA-independent  HFDC-subtrees  on  some  p' 
processors  (property  B 1)  while  preventing  each  of  the  other  p~  p'  processors  from  expanding 
more  than  h'  non-cross  nodes  (property  B2).  (Recall  that  in  the  proof  of  Theorem  5.5  subtrees 
rooted  at  frontier  nodes  at  time  Ti  are  PA-independent  HFDC-subtrees.) 


Figure  5.6:  Around  the  time  when  condition  Cl  becomes  true. 


If  condition  Cl  holds  at  time  T\,  then  from  Figure  5.6  we  can  find  a  set  T  of  //  processors 
for  which  condition  Cl  and  property  B2  hold.  According  to  Lemma  5.5,  there  are  at  least 
k{=  {d  —  l)/i')  PA-independent  HFDC-subtrees  on  each  processor  vhich  has  expanded  h' 
non-cross  nodes.  So  there  are  at  least  C'(=  p'n)  PA-independent  HFDC-subtrees  in  F  at  this 
ti  lie.  Therefore,  property  B1  holds,  and  we  are  ready  for  stage  2. 

In  stage  2,  we  will  repeatedly  apply  rules  R5-R7  until  time  Tz  when  condition  C4  or  C5 
holds.  (Note  that  these  rules  are  exactly  the  same  as  those  of  stage  2  in  Section  5.1.3.)  According 
to  property  Bl,  initially,  there  are  at  least  C  PA-independent  HFDC-subtrees  in  F.  In  stage 
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2,  these  subtrees  continue  to  be  HFDC-subtrees,  because  either  rules  A 1  and  A2  are  followed 
( using  R5  and  R7)  or  some  cross  nodes  are  expanded  (using  R6).  In  addition,  by  rule  R6,  the 
set  r  of  the  other  p-  j/  processors  will  not  generate  any  new  nodes. 

Now,  we  want  to  show  that  one  of  the  conditions  C2-C5  must  become  true  at  time  Ti  or 
Tz.  According  to  Lemma5.6  (in  Section  S.i.4.3),  at  any  time  in  stage  1  or  2  properties  1 1-14 of 
Lemma  5.1  hold;  so,  at  any  time  in  stage  1  or  2  the  tree  will  be  able  to  grow  to  a  (.V,  /i,  d)-tree 
by  Lemma  5.1.  Hence,  if  C2  or  C4  never  hold,  C3  or  C5  becomes  true. 

Stage  3  starts  right  after  one  of  the  conditions  C2-C5  becomes  true.  (If  C2  or  C3  holds  at 
Ti,  this  implies  that  stage  2  is  empty.)  Since  Lemma  5.6  also  shows  that  properties  11-14  of 
Lemma  5. 1  hold  for  the  tree  at  time  Ti  or  Tz,  in  stage  3  we  will  follow  the  procedure  described 
in  the  proof  of  Lemma  5. 1  to  grow  the  tree  to  a  ( N,  k,  d)-tree  H. 

To  complete  the  proof,  we  observe  that  if  Ca(H)  >  C  it  will  remain  true  for  the  rest  of  the 
tree  construction  process.  Therefore  property  Q2  of  Theorem  5.5  will  hold  for  H. 

Now,  assuming  that  Ca{H)  <  C,  we  want  to  prove  that  property  Q1  holds  for  H.  Since 
C2  and  C4  never  hold,  either  C3  will  become  true  at  time  T\  or  C5  will  become  true  at  time 
Tz-  First,  suppose  that  condition  C5  becomes  true  at  time  Tz-  To  prove  that  property  Q1 
holds  in  this  case,  we  will  derive  an  upper  bound  on  the  total  number  of  nodes  expanded  in 
r.  The  upper  bound  consists  of  five  terms  U\,  Uz,  LS.  and  ('5.  Assume  that  there  are 
C\  <  U\  =  C'  cross  nodes  expanded  in  T  in  stage  1.  In  stage  I,  the  processors  in  f  have 
expanded  at  most  Uz  =  {p  —  p')h'  non-cross  nodes  due  to  property  B2.  These  nodes  expanded 
in  stage  1  will  generate  at  most  Uj  =  {{p  —  p')h'  -1-  C\  )d  frontier  nodes  in  F  at  time  T\,  each  of 
which  can  be  expanded  at  most  once  in  F.  After  time  Ti,  it  is  also  possible  for  the  processors 
in  F  to  expand  nodes  moved  from  the  processors  in  F.  The  total  number  of  these  nodes  is 
Ui  <  C^iH)  —  C\.  Moreover,  to  take  care  of  the  nodes  generated  after  Tz,  processors  in  F 
may  expand  up  to  U5  <  pd  +  d  +  h  nodes.  Therefore,  the  total  number  of  nodes  expanded  in 
F  is  at  most  U  =  U\  +  Uz  +  Uz  +  LU  +  Uz  <  3p<fh.  This  implies  that  the  processors  in  F  have 
expanded  at  least  N  —  U  =  N  —  3pcTh  nodes;  therefore,  Ta(H)  >  (.'V  —  3pd~h)lp'  >  S'  Ip', 
i.e.,  property  Q1  holds. 

Suppose  that  condition  C3  becomes  true  at  time  T\.  Since  condition  Cl  does  not  hold  in 
stage  1,  we  can  find  a  set  F  of  p'  processors  with  property  B2  (see  Figure  5.6  also).  Since  stage 
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2  is  empty  for  this  case,  we  can  let  time  Tz  be  the  same  as  Ti.  Thus,  we  can  use  the  same 
technique  as  above  to  prove  that  property  Q1  holds.  [] 


5.143  Relevant  Lemmas 


Lemma  53  Suppose  that  there  are  k  PA-independent  subtrees  at  some  time  during 
the  computation.  If  each  of  these  subtrees  has  at  least  one  expanded  cross  node, 
then  the  total  number  of  expanded  cross  nodes  in  the  whole  tree  constructed  so  far 
is  at  least  k. 


Proof.  This  proof  is  not  trivial  because  among  these  subtrees  those  with  ancestry 
relationship  may  contain  the  same  expanded  cross  node. 


Figure  5.7:  Expanded  cross  nodes  corresponding  to  PA-independent  subtrees. 

In  this  proof,  we  will  prune  the  k  PA-independent  subtrees  one  by  one  under 
the  restriction  that  the  subtree  being  pruned  contains  no  other  subtrees  which  have 
not  been  pruned  yet.  (For  the  example  illustrated  in  Figure  5.7,  we  can  prune  the 
subtrees  in  the  order;  %,  Tj,  Tz,  and  T\.)  For  this  proof,  it  suffices  to  prove  that 
each  pruned  subtree  has  at  least  one  expanded  cross  node. 

Initially,  the  first  pruned  subtree  obviously  has  at  least  one  expanded  cross  node 
by  the  assumption  of  the  lemma.  As  mentioned  in  Section  5. 1.4.1,  for  any  two 
PA-independent  subtrees  T  and  T'  rooted  at  nodes  r  and  r'  respectively,  if  r  is  an 
ancestor  of  r',  there  must  exist  at  least  one  expanded  cross  node  on  the  path  from  r 
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Area  1 


B  :  Non-cross  node 


Area  3 

Figure  5.8:  In  stage  1,  any  non-cross  node’s  ancestors  in  area  2  must  have  been  generated  on 
the  same  processor. 

(inclusive)  to  the  parent  (inclusive)  of  r'  due  to  processor  independence.  Therefore, 
if  we  prune  T'  at  r',  T  still  has  at  least  one  expanded  cross  node.  Hence,  after 
we  prune  each  subtree  under  the  above  restriction,  each  of  the  remaining  subtrees 
will  still  have  at  least  one  expanded  cross  node.  This  implies  that  the  next  pruned 
subtree  also  has  at  least  one  expanded  cross  node.  So,  each  pruned  subtree  has  at 
least  one  expanded  cross  node.  [] 


Lemma  5.4  At  some  time,  if  there  are  k  PA-independent  HFDC-subtrees  and  fewer 
than  k  expanded  cross  nodes,  there  exists  an  HFD-subtree. 


Proof.  Assume  that  there  exists  no  HFD-subtree.  Thus,  each  of  these  PA- 
independent  HFDC-subtrees  has  at  least  one  expanded  cross  node  according  to 
the  definition  of  HFDC-subtree.  By  Lemma  5.3,  there  are  at  least  k  expanded  cross 
nodes.  This  is  contradictory  to  the  assumption  of  the  lemma.  [] 


Lemma  5.5  In  stage  I,  if  a  processor  has  expanded  h'  non-cross  nodes,  then  there 
are  at  least  k  PA-independent  HFDC-subtrees  on  the  processor. 
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Proof.  As  mentioned  in  Section  5. 1 .4.2,  each  subtree  rooted  in  area  I  or  2  is  always 
an  HFDC-subtree  in  stage  1.  Thus  it  suffices  to  prove  that  at  least  k  nodes  with 
ancestry  independence  in  areas  1  and  2  will  be  generated  on  the  processor  after  h' 
non-cross  nodes  have  been  expanded.  By  rules  R1-R3,  for  any  non-cross  node,  all 
of  its  ancestors  in  area  2  (with  h'  1  levels)  must  be  non-cross  nodes  as  shown  in 
Figure  5.8.  So,  all  the  nodes  generated  by  the  first  h'  non-cross  nodes  must  be  in 
areas  1  and  2.  Since  each  of  the  h'  non-cross  nodes  will  generate  d  children  and 
can  remove  at  most  one  ancestor,  these  non-cross  nodes  will,  in  total,  generate  at 
least  {d  —  l)/i'(=  k)  nodes  with  ancestry  independence.  [] 


Lemma  5.6  At  any  time  in  stage  1  or  2,  including  time  T\  or  Ti,  the  tree  satisfies 
properties  11-14  of  Lemma  5.1. 

Proof.  It  is  obvious  from  rules  R1-R7  that  12  and  13  are  satisfied.  In  addition,  it 
is  also  obvious  that  1 1  holds  before  condition  C3  or  C5  becomes  true.  Consider 
the  first  time  step  when  at  least  N  -  {pd  +  h  d)  nodes  have  been  generated  (i.e., 
condition  C3  or  C5  holds).  Since  the  tree  has  no  more  than  .\  -  { pd h d)  nodes 
in  the  previous  time  step  and  since  at  most  pd  nodes  will  be  generated  in  each  time 
step,  there  are  at  most  -  h  —  d  nodes  in  the  current  time  step.  In  the  rest  of  this 
proof,  we  will  show  that  14  always  holds  (i.e.,  there  always  exists  an  HFD  subtree) 
in  each  stage. 

In  stage  1,  all  the  nodes  in  area  I  will  generate  d  nodes  by  rule  Rl .  So,  before 
all  the  nodes  in  area  1  have  been  expanded,  there  must  exist  one  frontier  node  in 
area  1,  of  which  the  subtree  (with  only  one  node)  is  an  HFD-subtree.  After  all  the 
nodes  in  area  1  are  expanded,  there  are  at  least  >  pdh  >  C"  subtrees 

rooted  at  the  top  level  of  area  2.  Obviously,  these  subtrees  are  PA-independent. 
They  are  also  HFDC-subtrees  because  each  subtree  rooted  in  area  1  or  2  in  stage 
1  is  always  an  HFDC-subtree  as  described  in  Section  5. 1.4.2.  Since  the  number 
of  expanded  cross  nodes  is  always  less  than  C  (due  to  condition  C2),  there  has 
always  been  an  HFD-subtree  up  to  time  Tt  by  Lemma  5.4.  Thus,  we  can  conclude 
that  there  always  exists  an  HFD-subtree  in  stage  1 . 

In  stage  2,  initially,  there  are  at  least  C  PA-independent  HFDC-subtrees  in  P 
(property  Bl).  These  subtrees  will  continue  to  be  HFDC-subtrees  in  this  stage 
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as  described  in  Section  S.  1.4.2.  In  stage  2.  due  to  condition  C4  the  number  of 
expanded  cross  nodes  is  always  less  than  C';  so,  there  always  exists  an  HFD- 
subtree  by  Lenuna  5.4.  [] 

5.2  Parallel  Range  Selection 

Selection  [17,  28,  42]  is  a  very  common  operation,  which  we  define  as  follows. 

Given  a  set  of  elements  each  containing  a  key,  and  given  an  integer  value  M, 

\  <  M  <  N,  select  M  elements  that  have  the  smallest  key  values'. 

(Note  that  when  applied  to  our  multilist  scheduling  system,  elements  are  equivalent  to  tasks 
and  keys  are  equivalent  to  priorities.) 

For  the  parallel  selection  problem,  some  efficient  algorithms  have  been  designed  by  other 
researchers  [16,  77],  but  they  usually  do  not  try  to  minimize  the  number  of  critical-path 
sends/receives^.  For  example,  we  estimate  that  the  parallel  selection  algorithm  [16]  (a 
straightforward  parallel  design  of  [42])  requires  an  average  of  0{  log  p  log  .V )  critical-path 
sends/receives.  However,  in  a  network-based  multicomputer,  we  want  to  minimize  the  number 
of  critical-path  sends/receives  while  using  moderately  large  packets,  because  a  network-based 
multicomputer  has  the  following  characteristics.  First,  each  send/receive  incurs  a  significant 
amount  of  overhead,  e.g.,  a  couple  of  milliseconds  over  Ethernet  or  about  200  microseconds 
[27]  or.  Nectar,  as  opposed  to  tens  of  nanoseconds  per  instruction.  Second,  a  packet  with 
moderate  size  does  not  incur  a  significant  amount  of  overhead.  For  example,  sending  a  packet 
with  a  few  kilobytes  is  only  a  small  number  of  times  longer  than  that  for  a  single  word. 

If  we  wanted  to  reduce  the  number  of  critical-path  send/receives  to  O(logp)  while  using 

'Sometimes,  the  definition  of  selection  is  to  select  only  the  element  with  the  .V/-th  smallest  key.  However,  a 
selection  algorithm  which  can  find  the  W-th  smallest  key  t  usually  can  identify  the  M  elements  with  priorities 

<  T. 

^Here,  we  view  a  “path”  as  a  sequence  of  sends/receives  performed  on  any  corresponding  sequence  of 
processors.  The  path  with  the  largest  number  of  sends/receives  will  be  called  the  critical  path.  Sends/receives 
on  the  critical  path  are  called  critical-path  sends/ieceives.  The  time  for  a  computation  is  at  least  the  number  of 
critical-path  sends/receives  times  the  average  time  for  each  send/receive. 
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packets  of  unlimited  size,  we  could  use  a  naive  algorithm  in  which  each  processor  sends  a 
packet  of  M  smallest-key  values  in  order  to  construct  the  set  of  M  smallest  key  values  of  the 
entire  system.  Since  M  is  independent  of  the  number  of  processors  and  can  be  an  extremely 
large  number,  this  algorithm  may  have  poor  performance.  But  if  we  limited  the  packet  size  to 
a  moderate  size  (say,  a  polynomial  function  of  log  p),  it  would  be  very  difficult  to  reduce  the 
number  of  critical-path  sends/receives  to  0(log  p). 

Fortunately,  as  explained  in  Section  4. 1.2.2,  in  our  multilist  scheduling  system,  it  is  not 
critical  for  the  selection  problem  to  select  exactly  M  elements.  It  is  good  enough  to  select 
0(iV/)  elements.  Thus,  we  can  relax  the  selection  problem  to  the  following: 

Given  a  set  of  iV  elements  each  containing  a  key,  and  given  an  integer  value 
iV/,  1  <  M  <  iV,  select  elements  that  have  the  smallest  key  values,  where 
=  0(iV/),e.g.,  M  <  <  2M. 

This  problem  is  called  the  range  selection  problem  because  the  value  .\\et  is  in  a  range  0(  M  i, 
not  just  a  fixed  value,  .V/. 

By  taking  advantage  of  this  relaxation,  we  can  devise  a  very  efficient  para/Ze/  range  selection 
(PRS)  algorithm  which  minimizes  the  number  of  critical-path  sends/receives  (by  using  only 
one  combining  operation  and  then  one  disseminating  operation,  as  defined  in  Definition  4. 1 ), 
while  keeping  the  packet  size  moderate  (0(log‘p)).  First,  we  will  summarize  our  theoretical 
results  concerning  the  PRS  algorithm,  and  then  we  will  present  more  details. 


5.2.1  Summary  of  Results 

5.2. 1.1  Assumptions 

Our  new  PRS  algorithm  is  based  on  a  tree-shaped  network  of  processors.  For  simplicity  of 
discussion,  we  make  two  assumptions  about  the  network  as  follows. 

•  The  processor  tree  is  a  complete  binary  tree.’ This  implies  that  the  leaf  processors  are  all 
at  the  same  bottom  level  q  of  the  processor  tree,  where  q  \s\gp  and  p  is  the  number  of 
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leaf  processors. 

•  All  elements  are  distributed  over  leaf  processors,  not  over  internal  processor  nodes.  For 
example, -if  internal  processor  nodes  are  embedded  in  leaf  processors,  as  in  the  example 
in  Figure  4.4,  we  can  say  the  internal  processor  nodes  have  no  elements. 


Without  loss  of  generality,  we  also  assume  all  the  elements  have  distinct  key  values  for  the 
range  selection  problem.  If  elements  are  allowed  to  have  the  same  key  values,  we  can  redefine 
the  key  value  in  order  to  make  each  element  have  a  distinct  key  value,  as  follows.  For  an 
element  on  a  leaf  processor  P„  if  the  original  key  value  is  t  and  the  element  is  the  ;  th  element 
with  the  key  value  of  ir  on  P,,  we  can  use  a  compound  key  as  its  new  key:  t  is  the 

primary  key,  i  is  the  secondary  key,  and  j  is  the  tertiary  key. 


5.2.1  Main  Result 


Theorem  5.6  Given  a  processor  tree  with  the  above  assumptions,  an  algorithm 

can  be  devised  to  solve  the  PRS  problem  and  ta  satisfy  the  following  properties. 

U1  The  algorithm  only  requires  one  combining  operation  and  then  one  dissemi¬ 
nating  operation  (see  Definition  4.1). 

U2  Each  packet  size  is  0(lg*p).  More  accurately,  if  p  >  2,  each  packet  has 
at  most  [k  Ig"  p]  +  1  items  each  containing  one  load  value  ( representing  an 
element  count)  and  one  key  value,  where  K  =  1/(1  —6I2),6  =  l/(lgelgp), 
and  e  is  the  base  of  the  natural  logarithm,  approximately  2.718.  Note  that  if 
p  oo  then  ^  0  and  k  — ♦  1. 

\J3  The  total  time  is  0(log^p  4-  (logp)(log  .V)),  if  we  make  the  following  two 
assumptions: 

•  It  takes  0(1)  time  to  send  each  piece  of  data. 

•  Each  ( leaf)  processor  maintains  elements  based  on  the  priority  queue 
described  in  Section  4.2.  Tby  letting  the  grain  size  of  each  element  (cor¬ 
responding  to  a  task)  be  one. 
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The  PRS  algorithm  that  satisfies  these  properties  is  very  efficient  for  the  following  reasons. 
Concerning  the  first  property,  using  only  one  combining  operation  and  one  disseminating 
operation  is  quite  efficient  because  the  number  of  sends/receives  and  the  number  of  critical- 
path  sends/receives  are  both  optimal.  Tnis  advantage  is  especially  important  on  a  network-based 
multicomputer,  where  each  send/receive  incurs  significant  overhead. 

The  second  property  shows  that  the  packet  size  is  moderate.  Assume  that  each  item  requires 
8  bytes.  Then,  for  example,  if  p  is  1,000,  the  packet  size  is  about  1  kilobyte;  if  p  is  1. 000. 000, 
the  packet  size  is  only  about  4  kilobytes.  In  many  networks,  the  time  for  sending  a  packet  with 
4  kilobytes  is  only  a  small  number  of  times  longer  than  that  for  a  single  word  (see  [27]).  From 
the  above  two  properties,  we  conclude  that  our  PRS  algorithm  is  very  efficient  in  network-based 
multicomputers. 

The  third  property  shows  that  the  time  complexity  for  a  parallel  system  with  fast  commu¬ 
nication,  like  the  CMS  [99],  is  also  quite  low. 

In  the  remaining  sections,  we  will  design  a  PRS  algorithm  which  satisfies  the  above  three 
properties,  U1-U3.  This  algorithm  uses  only  one  combining  operation  and  one  disseminating 
operation  (property  Ul),  in  which  each  packet  size  satisfies  property  U2.  The  combining 
operation,  described  in  Section  5.2.3,  combines  the  key  value  distribution  lists  (defined  in 
Section  5.2.2)  of  all  the  processors  into  the  root  processor.  Note  that  the  data  of  these  lists 
roughly  represent  the  key  value  distributions;  these  lists  can  be  merged  without  too  much  loss 
of  accuracy.  In  Section  5.2.3,  we  will  also  show  that  on  the  root  processor  we  can  select  a 
key  value  threshold  rrtkT  from  the  final  combined  list  (which  roughly  represents  the  key  value 
distribution  of  the  entire  system),  such  that  the  total  number  of  elements  with  priorities  <  -thr 
is  0(iV/).  After  the  combining  operation,  we  can  simply  disseminate  TTthr  to  all  the  processors 
to  select  all  elements  with  priorities  <  irthr-  Thus,  this  algorithm  solves  the  PRS  problem.  In 
Section  5.2.4,  we  will  also  prove  that  the  algorithm  satisfies  the  third  property  U3.  Finally, 
in  Section  5.2.5,  we  will  discuss  the  case  in  which  the  degree  of  processor  tree  is  a  constant 
(integer)  more  than  two,  and  also  describe  how  the  algorithm  can  be  applied  to  our  multilist 
system,  as  described  in  Section  4. 1 .2.2. 
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5.2.2  Key  Value  Distribution  Lists 


In  our  PRS  algorithm,  each  processor  Pi  needs  to  generate  a  list  A,  which  can  roughly  represent 
the  key  value  distribution  in  the  whole  processor  subtree  rooted  at  P,.  We  will  call  such  a  list  a 
key  value  distribution  list  or  a  KVD  list.  A  KVD  list  has  several  items  and  each  item  contains 
a  pair  of  values:  a  key  value  and  a  load  value.  For  analyzing  a  KVD  list,  we  will  use  the 
following  notation: 

•  m,:  denotes  the  number  of  items  in  the  KVD  list  A,. 

•  (vij,  Lij):  denotes  the  jth  item  in  the  KVD  list  A„  where  \  <  j  <  m,.  Here,  tt,^  and  L,j 
are  the  key  value  and  the  load  value  of  this  item,  respectively. 

•  Fj:  denotes  the  processor  subtree  rooted  at  P,. 

•  n,  (T):  denotes  the  number  of  elements  with  the  key  value  ir  in  the  processor  subtree  F, . 

•  iVi(7r):  denotes 

For  simplicity,  if  the  root  processor  is  Prt,  we  will  use  n(7r)  and  .V(n')  to  stand  for  nrt(T) 
and  iV,t(r),  respectively. 

From  the  above  definition  of  we  will  have  the  following  basic  properties: 

•  iV,(oo)  is  the  total  number  of  elements  in  the  processor  subtree  F,  while  N,(—oc )  is  zero, 
where  oc  (—00)  is  a  value  higher  (lower)  than  any  key  value  which  can  be  used. 

•  *^^,(^1)  <  :Vi(n-2)  if  TTi  <  1^2-  That  is,  the  function  .V(t)  is  monotonically  increasing. 

•  If  processor  P,  is  an  internal  processor  and  processors  P/  and  Pr  are  the  two  children  of 
P,,  then  V,(7r)  =  V/(7r)  +  iVr(7r)  for  each  tt.  This  is  due  to  the  assumption  that  each 
internal  processor  contains  no  elements. 


Definition  5.4  From  above,  a  KVD  list  A,  (generated  on  processor  PJ  is  called  a 
k-deviant  KVD  list,  if  the  following  two  properties  hold  for  the  list: 
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VI  The  key  values  are  strictly  increasing  and  the  load  values  are  monotonically 
increasing  with  the  following  restrictions. 

1.  For  the  final  item  if  TTi.m,  =  oo,  then  =  V.(oo); 

otherwise.  M  <  Li,m,  <  iV,(5r,.„.). 

2.  All  the  non-final  load  values  (i.e.,  all  the  load  values  except  for  the  final 
one)  are  less  than  M. 

V2  Let  the  list  have  a  pseudo  item  (x,o,  L,o),  where  tt.o  =  — oo,  and  L.o  =  0.  For 
each  key  k,  where  tt.j  <  tt  <  and 0  <  j  <  m.  —  1,  the  value  .V,(7r)  is 

in  the  range,  Lij  <  Niiir)  <  kL.j. 


From  the  above  definition,  if  A:  <  k'.  a  ^-deviant  KVD  list  is  obviously  also  a  ^-'-deviant 
list. 


Ni(n):  [1.21  [2.41  [4,8] 

-oo  7C,7  TCj2  Tt/i 


[6,12] 


Ki4 

(b) 


[10.  oo] 


7C 


Figure  5.9:  (a)  A  key  value  distribution  in  the  processor  subtree  F,,  showing  a  KVD  list  A, 
(containing  five  items)  which  is  2'deviant,  given  iV/  =  9.  (b)  Simplified  diagram  to  show  the 
possible  range  of  Ni(Tr),  given  the  2-deviant  KVD  list  in  (a). 
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Figure  5.9(a)  illustrates  a  key  value  distribution  in  the  processor  subtree  F,,  showing  a  KVD 
list  A,  (containing  five  items)  which  is  2-deviant,  given  M  =  9. 

An  important  feature  of  a  A: -deviant  KVD  list  is  that  for  each  key  ir  <  tt,  (the  final  key 
value)  we  can  find  a  load  value  L  from  the  KVD  Hsu  such  that  L  <  N,(Tr)  <  kL.  In  Figure 
5.9(a),  the  shadowed  area  indicates  the  possible  range  of  N,{ir)  for  each  x.  (Note  that  given  a 
2-deviant  list  the  possible  range  of  each  Ni(T)  can  also  be  depicted  in  a  simpler  way  in  Figure 
5.9(b).)  Thus,  the  list  provides  us  with  the  rough  KVD  information. 

In  the  next  section,  we  will  create  KVD  lists  containing  an  exponentially  increasing  series 
of  load  values.  Thus,  the  number  of  elements  in  a  KVD  list  can  be  reduced  to  a  very  small 
number. 


5.23  Combining 

In  this  section,  we  will  first  design  a  combining  operation  which  can  find  a  key  value  thresh¬ 
old  TTthr  satisfying  the  condition  xV(7rt/,r)  =  ©(.V/),  and  in  which  each  packet  has  at  most 
f«(lgp)(lg.V/)1  -f  1  items  (where  k  is  defined  in  Theorem  5.6).  Then,  we  will  further  im¬ 
prove  the  operation  such  that  each  packet  only  needs  at  most  f«Ig‘pl  4-  1  items  (note  that 
M  may  be  much  larger  than  p),  while  we  can  still  find  a  key  value  threshold  satisfying 

=  e(M). 


We  summarize  the  combining  operation  in  which  each  packet  has  at  most  [«( Ig  p)(  Ig  .\[ )]  -|- 1 
items,  as  follows.  Each  leaf  processor  first  creates  a  2-deviant  KVD  list  with  at  most  [Ig  .1/]  -|- 1 
items,  as  described  in  Section  5.2.3. 1 ,  and  then  sends  the  list  to  its  parent.  Then,  for  each  internal 
processor  node  P,,  if  its  two  children  have  generated  A  -deviant  KVD  lists  and  have  sent  their 
lists  to  P,,  then  P,  can,  as  shown  in  Section  5.2.3.2,  merge  their  lists  into  a  KVD  list  with 
A:(  1  +  <5)-deviation  and  with  at  most  +  •  items.  If  processor  P,  is  not  the  root, 

the  list  will  be  sent  to  its  parent  which  in  turn  repeats  the  above  operation.  If  processor  P,  is 
the  root  Prt,  Lemma  5.7  below  proves  that  the  KVD  list  Xrt  of  the  root  processor  is  4-deviant. 
Now,  we  can  choose  the  final  key  value  in  Kt  as  the  key  value  threshold  because  the 

condition  N{Trrt.mr,)  =  ‘^0  h^'ld?  according  to  Lemma  5.8. 
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Lemma  5.7  For  the  combining  operation  described  above,  the  KVD  list  on  the 
root  processor  is  4-deviant. 


Proof.  By  Lemma  5.9  in  Section  5.2.3. 1,  all  the  KVD  lists  on  leaf  processors  (at 
level  q)  are  2-deviant.  Then,  the  KVD  lists  on  those  internal  processor  nodes  at 
level  9  -  1  are  2(  1  -I-  5)-deviant,  by  Lemma  5.1 1  in  Section  5.2.3.2.  Whenever  we 
go  up  one  level,  the  deviation  degree  becomes  ( 1  -|-  <5)  times  larger.  Thus,  the  list 
on  the  root  processor  will  become  2(  I  +  6)''-deviant.  Since^  2(  1  -I-  <  4,  the  list 

is  also  a  4-deviant  list.  [] 


Lemma  5.8  For  the  combining  operation  described  above,  N{-Srt.mrt )  =  ®( 
where  'Srt.mn  w  key  value  in  the  KVD  list  \rt  of  the  root  processor  Prt- 


Proof.  We  will  prove  that  (1)  iV(?rrt.m„)  >  and  (2)  N(T^Tt.mrt)  ^ 

( 1 )  We  will  prove  N{  from  the  first  restriction  of  Property  VI  in  the 

KVD  list  Xrt  (note  that  the  list  is  4-deviant  from  Lemma  5.7).  If  TTrt.mr,  - 
then  iVrt(oo)  >  M  because  the  total  number  of  elements,  iV(oo),  is  at  least  .\/ 
from  the  definition  of  the  PRS  problem.  is  not  x).  A/  <  .V,.,(rr,.t. 

Thus,  iV(7rrt.m„)  >  A/,  for  both  cases. 

(2)  Since  N(Trrt,mr,)  =  -  1)  +  n(x,.(.,„^,),  it  suffices  to  prove  that  the 

following  two  conditions  hold:  {a)  n{Trt.mr,)  <  I  and  (b)  -  I)  < 

4iV/.  Since  all  elements  have  distinct  key  values,  each  n(7r)  <  1,  i.e.,  the 
condition  (a)  holds.  Since  the  KVD  list  Art  is  4-deviant,  .V(/Trf,„,^,  -  1 )  < 
4Lrt,m„-i-  From  the  second  restriction  of  property  VI,  Lrt.mr,-\  <  A/. 

Thus,  N{-Krt,mr,  -  1)  <  i  e-.  condition  (b)  holds.  [] 

The  above  combining  operation  can  find  a  key  value  threshold  satisfying  the  condition 
N{-Sthr)  =  with  each  packet  requiring  at  most  |’/i(lgp)(lg  A/)]  +  1  items.  This  upper 

^The  Maclaurin  series  (Taylor  expansion  around  zero)  for  log(  l-|-i)is6  -  ^  +  In  addition,  since 

p  >  2,  I  >  i  >  0.  Hence,  we  can  derive  that  i  >  log(  1-1-6)  >  ^  —  y  -  Thus,  ( I  =  2’'®*'"'"^’  <  2''^"®'’  =  2. 
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bound  is  greater  than  ^  (in  Property  U2)  when  M  >  p.  So,  we  will,  in  Section 

5.2.3.3,  reduce  the  number  of  items  in  each  list,  such  that  each  packet  has  at  most  [k  Ig*  pi  +  1 
items  (satisfying  Property  U2)  while  retaining  the  property  that  the  key  value  irrt,mr,  is  a  key 
value  threshold  satisfying  the  PRS  problem. 


5.2  J.l  Leaf  Processors 

In  this  section,  we  will  design  the  algorithm,  called  the  Create  Algorithm  (below),  that  each  leaf 
processor  will  use  to  create  its  KVD  list.  Then,  we  will  prove  in  Lemma  5.9  that  the  created 
KVD  list  is  2-deviant,  and  in  Lemma  5.10  that  the  list  has  at  most  fig  M]  +  1  items. 

Create  Algorithm 

'  Step  1  Initially,  let  the  variable  L  =  1  and  the  list  A,  be  empty. 


Figure  5.10;  A  key  value  distribution  in  the  processor  subtree  P,,  showing  the  2-deviant  KVD 
list  A,  (containing  five  items)  which  is  generated  by  the  Create  Algorithm,  given  .V/  =  9. 
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Step  2  If  iV,(oo)  <  L,  append  a  new  item  (oo,  Ni{oo))  to  the  end  of  the  list  A,  and  then  stop. 

Step  3  Find  the  ith  smallest  key  value  tt,  and  append  a  new  item  (x,  L)  to  the  end  of  the  list 
Xi.  Note  that  since  all  elements  have  distinct  key,  Ni(ir)  =  L. 

Step  4  If  I  >  M,  stop;  otherwise,  let  L  =  2L  and  repeat  Step  2. 

From  the  above  algorithm,  all  the  non-final  items  in  this  list  should  be  generated  at  Step  3 
and  hence  their  load  values  are  1,2,4,....,  and2'"’“-(<  M).  Figure  5. 10  illustrates  a  key  value 
distribution  in  the  processor  subtree  F,  rooted  at  processor  P,,  showing  the  2-deviant  KVD  list 
A,  (containing  five  items)  which  is  generated  by  the  Create  Algorithm,  given  .V/  =  9. 


Lemma  5.9  On  each  leafprocessor,  its  KVD  list  generated  by  the  Create  Algorithm 
is  2-deviant. 

Proof.  We  will  prove  that  on  each  leaf  processor  P,,  its  KVD  list  A,  generated  by 
the  Create  Algorithm  satisfies  Properties  VI  and  V2. 

VI  We  will  consider  the  non-final  items  first  and  then  the  final  item.  From  the 
Create  Algorithm,  all  the  non-final  load  values  are  I,  2,  ...,  .)/), 

so  they  are  strictly  increasing  and  the  second  restriction  holds.  As  for  all 
non-final  key  values  x,j,  since  L(x,j)  =  L,j  and  non-final  load  values  are 
strictly  increasing,  we  can  derive  that  these  non-final  key  values  are  also 
strictly  increasing. 

Now,  let  us  consider  the  final  item  (x,  „,,,  If  the  algorithm  stops  at 

Step  2,  then  x,,„,  =  oo  and  L,,m,  =  V,(oc).  Otherwise,  the  algorithm  stops 
at  Step  3.  In  this  case,  x,,,„,  ^  oo  and  L,^m,  =  V,(x,  „„)  >  A/.  Thus,  the 
first  restriction  holds.  In  addition,  since  the  last  load  value  is  either  V.(oc) 
(greater  than  or  equal  to  each  iV,(x)  and  each  non-final  load  value,  which  is 
some  iV,( x))or  at  least  M,  the  value  is  no  less  than  any  non-final  load  value. 
Thus,  all  the  load  values  in  the  list  are  monotonically  increasing.  Since  the 
last  key  value  is  either  oo  or  a  key  value  x  such  that  >  M,  we  can 

derive  that  the  key  value  is  greater  than  any  non-final  key  value.  Thus,  all  the 
key  values  in  the  list  are  strictly  increasing. 
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V2  Consider  any  two  consecutive  items,  the  jth  item  and  the  (j  +  l)-st  item, 
where  0  <  j  <  —  1.  Since  L  =  iV,(ir)  for  each  item  (ir,  L)  generated 

by  the  Create  Algorithm,  we  can  derive  the  following  property:  for  each  key 
value  TT,  where  <  tt  <  tt, 1.^  =  N,(Tr,j)  <  N,{x)  <  = 

Luj+i).  In  order  to  prove  that  Property  V2  holds,  we  will  only  prove  that 
^  2Z,,j,  as  follows.  If  the  {j  +  l)-st  item  is  generated  at  Step  2, 
then  Luj+i)  =  iV,(oo)  <  2Lij\ otherwise,  the  item  is  generated  at  Step  3  and 


Leiniiui  5.10  On  each  leaf  processor,  its  KVD  list  obtained  by  the  Create  Algorithm 
has  at  most  fig  M]  +  I  items. 

Proof.  Since  the  non-final  values  are  1,2,  ...,  2’"*"-  (<  .V/),  we  can  obtain  that 
m,  —  2  <  IgiV/,  i.e.,  m,  <  fig  M]  -f  1.  Thus,  the  lemma  holds.  [] 


5  J  J.2  Internal  Processor  Nodes 

In  this  section,  we  will  design  an  algorithm  that  each  internal  processor  node  P,  uses  to  merge 
the  KVD  lists  from  both  of  its  children  P/  and  Pr  into  its  KVD  list  A,.  We  will  also  prove  that 
the  new  list  is  k{  1  (5)-deviant  if  both  KVD  lists  from  its  children  are  A:-deviant,  and  that  the 
new  list  has  at  most  f«(lgp)(lg.V/)]  -I-  1  items.  But,  before  investigating  the  algorithm,  we 
will  first  present  a  simpler  merge  algorithm  as  follows. 


Simple  Merge  Algorithm:  Let  the  symbols  (x,,  Li)  and  (x,.,  Lr)  rej^^resent  the  first  items  in 

A/  and  A^  respectively;  and  let  the  symbols  (xf,  PJ)  and  (xf,  Pf)  represent  the  previous  deleted 
items  in  A;  and  Ar  respectively. 

Step  1  Let  both  (xj,  PJ)  and  (xf,  Pf)  be  (-oo,  .V,(-oc)  =  0). 

Step  2  If  X;  =  Xr,  append  a  new  item  (x/,  P(+Pr)  to  the  end  of  A„  remove  both  first  items  from 
A/  and  A^,  and  go  to  Step  4.  Note  that  if  a  new  item  is  appended  then  the  values  of  x/,  Li, 
TTr,  Lr,  Tt'i,  L\,  xf ,  and  Pf ,  are  changed  implicitly. 
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Figure  5.1 1:  An  example  of  the  merge  operation  from  A;  and  A,  into  A,,  given  .V/  =  9. 

Step  3  If  TT/  <  TTr,  append  a  new  item  (ti,  Li+L'^)  to  the  end  of  A,  and  remove  the  first  item  of 
Aj;  otherwise,  append  a  new  item  (iTr,  Lr+L'i)  to  the  end  of  A,  and  remove  the  first  item 
of  Ar.  Note  that  after  this  step  the  values  of  xj,  I/,  Xr,  £r,  L',,  x' ,  and  L'^,  are  changed 

implicitly. 

Step  4  In  the  newly  appended  item  (x,  L),  if  L  >  M  or  x  =  oc,  stop;  otherwise,  repeat  Step  2. 


Figure  5.1 1  illustrates  an  example  of  the  merge  operation.  We  also  need  to  point  out  that 
this  merge  algorithm  will  not  repeat  Step  2  (fron.  Step  4)  in  the  condition  that  either  of  KVD 
list  A,  or  A,  is  empty.  Assume  that  the  final  key  value  of  one  list,  say  A^,  is  merged  into  A,. 
From  the  first  restriction  of  property  VI,  either  the  final  key  value  of  A^  is  cc  or  the  final  load 
value  of  A,,  is  at  least  M.  For  the  former  case,  the  appended  key  value  of  A,  is  also  cc  and 
therefore  the  appended  item  is  the  final  item  (see  Step  4)  in  the  new  list;  for  the  latter  case,  the 
appended  load  value  is  also  at  least  M  and  therefore  the  appended  item  is  also  the  final  item 
(also  see  Step  4).  From  the  above,  the  last  key  value  of  A,  must  be  no  greater  than  either  of  the 
last  key  values  of  A;  and  A^. 

Lemma  5.1 1  (below)  proves  that  if  the  deviation  degrees  of  the  lists  from  its  children  are 
the  same,  the  parent’s  list  has  the  same  deviation  degree.  Since  the  lists  of  all  leaf  processors 
are  all  2-deviant,  the  list  on  the  root  will  still  be  2-deviant. 
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Lenuna  5.11  For  the  above  simple  merge  algorithm,  if  both  A/  and  Xr  are  k- 
deviant,  the  new  list  Xi  is  also  k-deviant. 


Proof.  We  will  examine  Properties  VI  and  V2  in  the  new  list  A,. 

VI  The  above  simple  merge  algorithm  merges  items  of  both  lists  A/  and  A^  accord¬ 
ing  to  their  key  values.  Step  2  merges  two  items  with  the  same  key  value  into 
one  item,  while  Step  3  keeps  the  key  values  of  the  new  list  A,  in  increasing 
order.  Thus,  in  the  new  list,  the  key  values  are  strictly  increasing.  As  for  the 
load  values  in  the  list  A^,  thf  y  are  monotonically  increasing  for  the  following 
reason.  For  each  item  in  A,,  its  load  value  must  be  any  of  Li-^Lr  (Step  2), 
L'i+Lt,  and  £(+!'  (Step  3),  while  the  load  value  of  its  previous  item  (if  any)  is 
L\  +  L' .  Since  the  load  values  in  both  A/  and  A,  are  monotonically  increasing, 
Li  >  L'l  and  Lr  >  Hence,  the  load  value  of  an  item  must  be  no  less  than 
that  of  its  previous  item.  Therefore,  the  load  values  in  A,  are  monotonically 
increasing  too.  In  addition,  we  will  prove  that  the  two  restrictions  in  property 
VI  hold,  as  follows. 

1.  For  the  final  item,  (7r,.n,,,  suppose  that  =  oc.  The  final  items 
of  A(  and  A^  also  have  key  values  oc.  Hence,  both  final  load  values  of  A; 
and  Xr  are  iV((oo)  and  Nr{oo).  Thus,  from  Step  2,  the  final  load  value 
of  Ai  is  Ni{oo)  +  Nrioo)  =  N,ioc),  i.e.,  the  first  restriction  holds  in  this 
case. 

Suppose  that  v,^rn,  ^  Then,  from  Step  4,  the  final  load  value 
must  be  at  least  .V/.  So,  for  the  first  restriction,  we  will  only  need  to  prove 
that  the  fc''owing  condition  C  holds:  <  .V,(r, If  is  less 

than  both  and  x-r.m,.  we  can  derive  the  condition  C  from  Property 
V2  (which  will  be  shown  below).  If  or  Tr^.mr.  say  the  former,  is 
’Ti.m,  (note  that  7r,,m.  cannot  be  greater  than  either  or  as  shown 
earlier),  then  from  the  first  restriction  of  Property  VI  in 

the  list  A/.  Hence,  we  can  derive  that  the  condition  C  holds,  by  applying 
the  technique  used  in  the  proof  for  Property  V2  (below).  Therefore,  this 
restriction  holds. 
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2.  All  the  non-hnal  load  values  in  the  list  A,  are  less  than  M  because  if  one 
of  them  were  at  least  M  the  item  containing  it  would  be  the  final  item 
(see  Step  4). 
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Figure  5.12;  An  example  of  examining  property  V2. 

V2  In  the  above  merge  algorithm,  consider  the  moment  immediately  before  we 
append  the  ;th  item  to  A,,  where  1  <  j  <  m,.  The  key  value  of  the  (j  —  1  )- 
St  item  is  =  max(Tj,7r')  and  the  key  value  of  the  ;th  item  will  be 

Tr,j  =  min(T(,Tr).  Figure  5.12  illustrates  an  example.  Thus,  for  all  key  x, 
where  <  x  <  Kij,  the  condition  L'l  <  jV((t)  <  kL'i  holds  because  A; 

is  t-deviant;  the  condition  I'  <  N.(ir)  <  kL'^  holds  because  Ar  is  A-deviant. 

Since  =  .V((7r)  +  .V,.(r),  the  condition  L\  +  L'^  <  .V,(t)  <  kL\  +  A-/.' 
holds.  Since  the  {j  -  1  )-st  load  value  is  L\  +  LI,  property  V2  holds.  [] 

Although  this  algorithm  can  merge  KVD  lists  while  keeping  the  same  deviation  degree, 
the  number  of  items  in  the  new  list  may  double  after  each  merge  operation.  Hence,  the  list  of 
the  root  processor  and  those  internal  processor  nodes  near  the  root  may  have  0(p  Ig  .V/)  items. 
Thus,  the  performance  will  degrade  seriously,  given  a  large  p. 

In  order  to  solve  this  problem,  we  will  modify  the  above  simple  merge  algorithm  by 
removing  those  items  (excluding  the  first  and  final  items)  whose  load  values  are  too  “close" 
to  the  previous  load  values.  More  precisely,  at  Step  4,  we  will  add  an  operation  immediately 
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before  “repeat  Step  2”  as  follows:  delete  the  newly  appended  item  (x,  L"}  \f  L  <  (1  +  6) L', 
where  U  is  the  load  value  of  the  previous  item.  Lemma  5.13  (below)  will  prove  that  the  number 
of  items  can  be  greatly  reduced  to  at  most  r«:(lgp)(lg  A/)]  +  1.  The  price  that  we  have  to  pay 
for  reducing  the  number  of  items  is  that  the  list  will  have  .i  higher  deviation  degree.  Lemma 
5.12  proves  that  the  deviation  degree  becomes  only  ( 1  +  (^)  times  higher.  Since  ^  is  a  very 
small  number  {6  =  1/(2  Ige  Igp)),  we  can  still  keep  the  deviation  degree  of  the  KVD  list  Xrt 
on  the  root  processor  constant  (as  shown  in  Lemma  5.7). 


Lemma  5.12  In  the  above  modified  merge  algorithm,  if  both  A;  and  K  are  A  - 
deviant,  the  new  list  A,  is  k(  1  +  6)-deviant. 


Proof.  We  will  prove  that  for  the  new  list  A,  Properties  V 1  and  V2  hold. 

VI  For  the  modified  merge  algorithm,  we  still  keep  the  first  and  final  items,  and 
may  delete  some  items  between  them.  Since  the  ordering  of  the  remaining 
items  is  still  not  changed  and  the  final  item  is  not  deleted,  we  can  derive  that 
the  new  KVD  list  still  satisfies  Property  V 1 . 
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Figure  5.13:  An  example  of  removing  items. 

V2  In  the  KVD  list  A,,  consider  any  two  consecutive  items,  say  the  jth  item  and 
the  {j  +  l)-st  item.  Let  the  item  (x,  L)  be  the  last  removed  item  between 
the  jth  and  the  {j  +  l)st  items,  as  shown  in  Figure  5.13.  In  the  modified 
merge  algorithm,  we  remove  the  item  only  if  <  (1  +  6) L,j.  Thus,  for  each 
key  value  x,  x,j  <  x  <  x,(j.(.i),  L,j  <  V,(x)  <  kL  <  A-(  1  +  6)Lij.  Thus. 
Property  V2  holds.  [] 
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Lemma  5.13  For  an  internal  processor  node  Pi,  if  its  KVD  list  A,  is  merged  by 
using  the  above  modified  merge  algorithm,  the  number  of  items  in  X,  is  at  most 
r«(igp)(igM)i  + 1. 


Proof.  For  this  proof,  it  suffices  to  prove  that  the  number  of  non-hnal  items  is 
n  <  fK(lgp)(Ig  A/)].  For  every  two  consecutive  non-final  load  values  L'  and 
L  (the  next  load  value  of  L')  in  A„  the  condition  (1  4-6)1'  <  L  holds  from  the 
modified  merge  algorithm.  Hence,  the  last  one  among  all  the  non-final  load  values 
must  be  at  least  ( 1  -t-  6)'*”  ‘  (note  that  the  first  load  value  must  be  at  least  one).  Since 
each  non-final  load  value  is  less  than  M  from  the  second  restriction  of  Property 
VI,  we  can  derive  that  ( 1  -I-  6)""*  <  A/  or  n  -  1  <  (IgA/)/  lg(  1  -f-  S).  From  this 
result,  it  can  be  verified**  that  n  <  fK:(lg p)(lg  A/)] .  [] 


5J23  J  Improved  Combining  Operation 

For  the  above  combining  operation  described  in  the  previous  two  sections,  each  packet  has  at 
most  r«:(lgp)(lg  A/)l  +  1  items.  When  A/  >  p,  the  upper  bound  of  the  item  number  is  higher 
than  r«  Jg'p]  +  I  Property  U2.  So,  in  this  section,  we  want  to  reduce  the  upper  bound  to 
[k  Ig^  p]  -H  1,  while  letting  the  final  key  value  in  the  list  of  the  root  processor  be  a  key  value 
threshold. 

In  the  improved  combining  operation,  we  only  need  to  modify  the  original  Create  Algorithm 
by  removing  those  non-final  items  whose  load  values  are  less  than  M/p,  as  illustrated  in  Figure 
5.14(a)  with  A/  =  9  and  p  =  4.  Hence,  the  first  non-final  item  (if  any)  has  a  load  value 
/•.  M/p  <  L  <  2M/p.  Thus,  the  number  of  non-final  items  generated  by  the  new  Create 
Algorithm  will  become  at  most  [Igp]  (i.e.,  the  total  number  of  items  are  at  most  [Igp]  4-  1), 
because  we  can  double  at  most  figp]  -  I  times  between  M/p  (inclusive)  and  M  (exclusive). 

On  the  internal  processor  nodes,  we  still  use  the  modified  merge  algorithm  (described  in 
the  previous  section)  to  merge  the  KVD  lists.  It  can  be  easily  verified  from  the  modified  merge 
algorithm  and  the  new  Create  Algorithm  that  the  first  non-final  load  value  (if  any)  of  a  KVD 

■‘As  shown  earlier,  log(l  +  (5)  >  6  -  8^/1  =  6(1  -  8/2)  =  8/k.  Hence.  (lgiV()/lg(l  -I-  6)  < 
(lg^V)/((lge)(6//c))<«(lgp)(IgM). 
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priorities  here) 


Figure  5.14:  (a)  Removing  those  items  with  priorities  lower  than  A//p(=  9/4).  (b)  Increasing 
the  key  values  less  than  tt.  i  to  n-.  i. 
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list  on  an  internal  processor  node  is  still  at  least  M/p.  Then,  we  can  derive  in  Lemma  5. 14  that 
the  number  of  items  in  A,  is  at  most  Ig‘  p]  +  1. 


Lemma  5.14  For  the  above  improved  combining  operation,  if  a  KVD  list  Xi  on  an 
internal  processor  Pi  is  generated  by  the  modified  merge  algorithm,  the  number  of 
items  in  Xi  is  at  most  f/c  Ig^  p]  +  1. 

Proof.  Since  the  first  non-final  node  (if  any)  is  at  least  M/p  and  each  non-final 
node  is  less  than  M,  we  can  derive  that  the  number  of  items  in  A,  is  at  most 
fA:(lgp)(lg  A/)]  -1-  I,  by  applying  the  same  technique  used  in  the  proof  of  Lemma 

5.13.0 

Although  the  number  of  items  in  each  packet  is  reduced  to  what  we  want,  we  should  note 
that  after  the  modification  each  KVD  list  generated  by  the  new  Create  Algorithm  may  not  be 
2-deviant.  This  is  because  on  the  corresponding  leaf  processor  there  may  be  some  elements 
with  key  values  lower  than  the  first  key  value  of  the  list.  In  order  to  let  each  leaf  processor’s 
KVD  list  Xi  become  2-deviant  again,  we  increase  those  key  values  less  than  n-.i  (in  the  processor 
subtree  F;)  to  t.i,  as  illustrated  in  Figure  5.14(b).  After  increasing  the  key  values,  each  n{z) 
becomes  at  most  2M/p. 

Since  the  internal  processor  nodes  still  use  the  modified  merge  algorithm  to  generate 
their  KVD  lists.  Lemma  5.1 1  still  holds  and  then  the  KVD  list  in  the  root  processor  is  still 
4-deviant  from  Lemma  5.7.  Although  n(7r)  <  2M/p  for  each  tt,  we  still  can  derive  that 
N{Krt,mr,)  =  0(  A/),  by  applying  the  same  technique  used  in  the  proof  of  Lemma  5.8. 

The  above  result  only  shows  that  after  increasing  key  values,  the  value  .V ( )  is  ©(  XI ). 
Lemma  5.15  proves  that  if  we  increase  at  most  2A//p  nodes  on  each  of  p  leaf  processors 
and  the  value  N{ir)  is  Q{M)  after  increasing  these  key  values,  the  value  .V(7r)  is  also  0(.\/) 
before  the  increasing.  This  implies  that  the  condition  N{irrt.mrt)  =  0(A/)  also  holds  before 
the  increasing;  that  is,  the  final  key  value  is  a  key  value  threshold  satisfying  the  PRS 

problem. 
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Lemma  5.15  For  the  PRS  problem,  suppose  that  we  increase  key  values  of  at 
most  2M/p  elements  on  each  of  the  p  leaf  processors.  If  N{ir)  =  Q{M)  after 
we  increase  these  key  values,  then  the  original  N{ir)  (before  increasing  these  key 
values)  is  also  ©(M). 

Proof.  Let  iV/^/o«(7r)  denote  the  original  value  Niir)  and  iVu^,(7r)  denote  the 
value  iV(7r)  after  we  increase  some  key  values.  Since  we  increase  key  values 
of  at  most  p(2M/p)  elements,  the  value  ^V(7r)  decreases  at  most  2.V/  in  total. 
Hence,  Ntnforei^)  2iV/  ^  ^  that  iS,  ^  ^befiirei^)  ^ 

iVa^*r(7r)  +  2M.  Therefore,  if  =  Q{M),  then  =  0(A/).  [] 


5.2.4  Time  Complexity 

In  this  section,  we  will  analyze  the  time  complexity  of  the  PRS  algorithm  based  on  the  two 
assumptions:  (1)  it  takes  0(1)  time  to  send  each  piece  of  data;  (2)  each  (leaf)  processor 
maintains  elements  based  on  the  priority  queue  described  in  Section  4.2.1  by  letting  the  grain 
size  of  each  element  (corresponding  to  a  task)  be  one. 

The  KVD  lists  on  leaf  processors  are  generated  by  the  Create  Algorithm.  In  this  algorithm. 
Step  3  can  use  the  Threshpri  operation  (an  operation  on  the  priority  queue  described  in  Section 
4.2.1),  so  the  computation  time  at  Step  3  is  (9(log  .V,),  where  xV,  is  the  total  number  of  elements 
on  processor  P,.  Note  that  each  other  step  only  takes  0(  1)  time.  Since  .V,  <  .V,  the  time 
complexity  is  C>(logiV)  too.  Since  there  are  at  most  [Igp]  +  1  items  in  the  list  on  a  leaf 
processor,  we  will  repeat  each  step  at  most  [Igp]  +  i  times.  So,  the  total  time  on  each  leaf 
processor  is  0((logp)(Iog.V)). 

The  KVD  lists  on  internal  processors  are  generated  by  the  modified  merge  algorithm.  Since 
the  algorithm  does  a  merge  operation,  the  time  depends  on  the  number  of  items.  The  number  of 
items  in  each  list  is  at  most  flg“  p]  +  1  (or  0{  log*  p)),  so  the  total  time  taken  for  each  internal 
processor  is  only  0(log^ p).  Since  there  are  Igp  levels  in  the  processor  tree  and  processors 
at  the  same  level  can  process  the  merge  operation  in  parallel,  the  total  time  of  the  combining 
operation  is  0(Iog^p  +  (logp)(log  yV)). 

As  for  the  disseminating  operation,  we  broadcast  the  value  of  Wthr  from  the  root  processor 
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to  each  leaf  processor.  Obviously,  the  time  complexity  is  smaller  than  that  of  the  combining 
operation. 

After  each  leaf  processor  receives  the  value  irthr,  the  processor  will  use  the  SWJT  operation 
(an  operation  of  the  priority  queue  described  in  Section  4.2. 1 )  to  partition  the  priority  queue  into 
two  parts,  one  containing  tasks  with  priorities  if  >  ifthr  and  the  other  containing  the  remaining 
tasks.  The  time  complexity  for  the  Split  operation  is  only  0(log  n). 

From  the  above,  we  conclude  that  the  total  time  complexity  is  0(  log^  p  -t-  ( log p)(  log  .\’ ) ). 


5.2^  Discussion 

In  this  section,  we  will  first  show  that  when  the  degree  of  the  processor  tree  is  a  constant 
(integer)  more  than  two,  we  still  can  solve  the  PRS  problem  by  using  one  combining  and 
disseminating  operation  in  which  each  packet  size  is  0(log"  p).  Then,  we  will  discuss  how  to 
apply  the  PRS  algorithm  to  our  multilist  scheduling  system. 

Suppose  that  an  internal  processor  node  P,  has  three  children.  Processor  P,  will  do  the 
3-way  merge  operation  after  receiving  three  KVD  lists  from  its  children.  Using  the  technique 
of  the  modified  merge  algorithm,  the  number  of  items  in  the  KVD  list  A,  is  still  0(log'  p)  and 
the  deviation  degree  of  A  still  increases  by  a  factor  of  ( 1-1-  6).  Since  the  height  of  the  processor 
tree  is  lower  than  Igp,  the  KVD  list  in  the  root  processor  is  still  4-deviant.  Thus,  the  final  key 
value  of  this  list  is  still  a  key  value  threshold  ifthr  satisfying  (ifthr)  =  0(.U).  This  can  be 
generalized  to  any  constant  degree  processor  tree,  so  we  still  can  solve  the  PRS  problem  by 
using  packets  with  size  0(log^  p). 

For  the  multilist  scheduling  model,  the  problem  in  Section  4. 1.2.2  is  as  follows.  Given  a 
load  threshold  Lthr  and  a  set  of  tasks  each  containing  a  priority  and  a  grain  size  (represented  by 
an  integer),  select  a  set  of  highest-p-ioritv  tasks  whose  total  load  (summation  of  grain  sizes)  is 
L,ei  =  0(  It/ir  ),ifeach  task’s  grain  size  is  0(£.e/,r)  and  the  total  load  oftasks  is  £  (Ota;  =  il(Lthr)- 
If  the  maximum  task  grain  size  Gmax  is  larger  than  Lthr,  then  =  O(Gnax)  because  we  may 
need  to  group  the  task  with  Gmax  (e.g.,  when  the  task  with  the  grain  size  G^ax  has  the  highest 
priority).  If  the  total  load  of  tasks  is  Ltota/  <  Lthr,  then  L„i  =  Ltotai  because  we  can  at  most 
group  tasks  with  total  load  Ltotai- 
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Now,  we  want  to  translate  the  above  problem  to  the  PRS  problem  in  order  to  apply  the  PRS 
algorithm  presented  in  this  section.  Since  each  task  T  may  have  a  different  grain  size  Gj,  vve 
can  conceptually  break  a  task  into  Gr  elements  with  the  same  priority.  Then,  for  each  element 
which  is  the  j-th  element  with  priority  ir  on  processor  P„  we  can  dehne  its  key  value  as  the 
compound  key  (— tt,  i,j),  such  that  all  key  values  are  distinct.  Note  that  by  negating  tt,  we 
translate  highest  priority  to  lowest  key  value.  We  should  note  that  when  we  break  a  task  T 
into  Gr  elements,  we  let  these  elements  be  the  ith  element  to  the  ( i  +  Gr  —  1  )-st  element  with 
the  same  priority  of  T  on  the  same  processor,  so  that  the  elements  for  the  same  task  will  have 
consecutive  key  values.  Thus,  there  is  at  most  one  task  whose  elements  are  partially  selected. 
Since  in  the  actual  implementation  we  still  need  to  select  the  whole  task  T,  the  total  load  of 
selected  tasks  may  be  Gmax  higher  than  the  total  number  of  selected  elements.  If  Gmax  <  Lthr< 
the  total  load  of  selected  tasks  is  still  @(  Lthr)- 

In  addition,  since  the  system  does  not  know  Ltotai  in  advance,  we  cannot  guarantee  that 
Ltotai  >  Lthr-  In  the  case  that  Ltotai  <  Lthr,  the  final  item  of  the  KVD  list  in  the  root  of 
the  processor  tree  must  be  (oo,  N{oc)  =  Ltotai)  from  the  first  restriction  of  property  VI.  In 
this  case,  we  can  simply  broadcast  the  final  key  value  (oo)  to  select  all  the  tasks  such  that 

Lsel  —  Ltotai- 
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Chapter  6 


Experimental  Results 


In  this  chapter,  we  will  describe  our  experiments  with  the  multilist  scheduling  model  and  show 
good  performance  results  for  the  PBFS-GPQ  and  PDC-WK  scheduling  algorithms.  (Since  for 
parallel  best-first  search  (BFS)  we  only  choose  PBFS-GPQ  and  for  parallel  divide-and-conquer 
(D&C)  we  only  choose  PDC-WK,  we  will  respectively  abbreviate  their  names  as  PBFS  and 
PDC  in  this  chapter  for  simplicity.)  In  Section  6.1,  we  will  describe  the  environment  in  which 
we  have  run  our  experiments.  Then,  we  will  present  two  application  examples,  the  Fibonacci 
problem  and  the  set  covering  problem,  each  of  which  can  use  either  the  PBFS  or  the  PDC 
scheduling  algorithm,  and  we  will  report  our  experimental  results  for  these  applications.  The 
two  application  examples  will  be  described  in  Sections  6.2  and  6.3,  respectively. 


6.1  Environment 


Our  multilist  scheduling  model  is  currently  implemented  on  Nectar  [6],  a  high-bandwidth  and 
low-latency  computer  network  developed  at  Carnegie  Mellon  University.  The  Nectar  system 
consists  of  a  Nectar  network  and  a  set  of  network  coprocessors,  called  communication  accel¬ 
erator  boards  {CABs),  as  illustrated  in  Figure  6.1.  We  can  connect  hosts  (e.g.,  workstations) 
to  Nectar  by  attaching  them  to  CABs  via  VME  buses.  Each  CAB  is  a  Sparc-based  network 
coprocessor  with  1 .5  Mbytes  of  local  memory.  Since  each  CAB  has  local  memory,  we  can 
install  some  processes  on  a  CAB.  The  Nectar  network  consists  of  100  Mbits/s  fiber-optic  links. 
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Figure  6. 1 :  The  Nectar  system. 


plus  16  X  16  crossbar  switches  called  HUBs.  The  CAB  is  connected  to  a  HUB  via  a  fiber 
link.  In  our  experiments,  we  use  up  to  8  Sun4/330s.  The  number  of  processors  is  not  large, 
but  is  sufficient  to  validate  our  multilist  scheduling  approach.  In  order  to  make  the  results 
directly  comparable,  we  run  our  experiments  when  no  other  users  are  using  these  machines. 
As  for  software.  Nectarine  [94]  is  an  interface  package  that  provides  efficient  communication 
primitives  for  programmers  to  access  the  Nectar  runtime  system.  We  have  implemented  our 
system  using  Nectarine. 

Basically,  the  implementation  of  the  multilist  scheduling  model  follows  the  description 
presented  in  Chapter  4.  For  maintaining  physical  lists  (PLs),  we  choose  2-4  trees  instead  of  2-3 
trees  simply  because  2-4  trees  may  require  slightly  fewer  operations  of  reconfiguring  the  trees. 
Note  that  for  all  integer  constants  a  >  2  and  6  >  a  the  heights  of  all  a-b  trees  are  still  0{  log  n ) 
[3],  where  n  is  the  number  of  distinct  priority  keys;  thus,  properties  PI  and  P2  in  Section  4.2 
still  hold  for  all  such  a-b  trees.  In  addition,  since  derived  PLs  have  not  been  implemented,  we 


102 


Figure  6.2:  GLB  trees  for  (a)  one,  (b)  two,  (c)  four,  and  (d)  eight  processors. 


For  maintaining  virtual  lists,  we  use  the  advanced  global  protocol  for  global  scheduling  and 
the  standard  protocol  for  other  situations.  In  the  global  protocol,  it  appears  that  the  degree  of 
the  global  load  balancer  (GLB)  tree  should  be  moderately  large  because  for  a  tree  with  small 
degree  (say  2)  we  need  more  GLB  processes  each  of  which  will  incur  some  overhead.  In  [90], 
Sinha  and  Kale  also  implemented  similar  global  scheduling  by  letting  one  load  balancer  handle 
at  least  8  processors  or  8  load  balancers,  i.e.,  they  used  a  load  balancer  tree  with  a  degree  at 
least  8.  But,  in  our  actual  implementation,  since  we  will  use  at  most  8  processors,  we  let  the 
tree  degree  be  4  so  that  we  can  conduct  experiments  for  a  GLB  tree  with  more  than  one  level. 
Figure  6.2  shows  GLB  trees  with  p  =  1,  2, 4,  and  8,  where  the  tree  for  8  processors  requires  a 
two-level  GLB  tree  and  others  require  only  a  one-level  GLB  tree. 


The  GLB  processes  can  be  installed  either  on  host  processors  or  on  CABs.  Since  most 
parallel  systems  do  not  use  network  coprocessors  (such  as  CABs),  it  is  natural  for  us  to  focus 
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on  the  case  of  installing  GLB  processes  on  host  processors.  However,  we  will  also  examine 
the  case  of  installing  them  on  CABs  in  order  to  see  the  beneht  of  using  network  coprocessors. 

For  the  analysis  of  the  experimental  results  in  the  following  sections,  we  define  speedup  and 
efficiency  in  Definition  6. 1  below.  The  time  taken  to  run  a  job  on  one  processor  of  the  parallel 
version  is  generally  higher  than  that  for  a  sequential  program  because  the  parallel  version  also 
includes  some  scheduling,  communication,  and  GLB  overhead.  Therefore,  it  makes  more  sense 
to  use  the  timing  results  for  the  sequential  version  as  a  baseline. 


Definition  6.1  Let  T,eq  be  the  execution  time  for  the  original  sequential  version 
of  a  program.  Also,  let  Tp  be  the  time  for  the  parallel  version  of  the  program 
on  p  processors.  We  define  the  SPEEDUP  as  Sp  =  T,eg/Tp  and  the  EFFICIENCY  as 
Ep  =  Sp/p.  If  we  use  Ti  instead  ofT,^^,  we  call  these  measures  SIMPLE  SPEEDUP. 
5'  =  Ti/Tp,  and  SIMPLE  EFRCIENCY.  £■'  =  5'/p. 


6.2  Fibonacci 


In  this  section  we  will  present  a  simple  Fibonacci  problem,  which  will  allow  us  to  derive  the 
average  overhead  for  one  task  and  to  investigate  the  case  in  which  tasks  have  a  minimum 
number  of  operations. 
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The  Fibonacci  problem  solves  the  function  F(n)  recursively  as  follows. 


_ ,  ,  I  1  if  n  =  0  or  n  =  1 

F(n)  =  f 

[  F(n  -  1)  +  F{n  ~2)  otherwise 

We  can  recursively  expand  this  computation  as  a  tree,  as  illustrated  in  Figure  6.3  with  n  =  3. 
This  can  be  considered  as  a  D&C  algorithm.  The  program  can  be  made  parallel  by  letting  each 
node  in  the  computation  tree  represent  a  thread.  After  the  node  for  F(n)  creates  two  children 
for  F(n  —  1 )  and  F{n  —  2).  the  node  is  suspended.  The  node  becomes  executable  again  only 
after  receiving  two  returned  values  F(n  —  1 )  and  F{n  -  2)  from  its  two  children. 

We  will  use  PDC  as  the  scheduling  algorithm  for  this  parallel  program.  As  described  in 
Section  3.1.2,  each  node  initially  has  local  priority  =  /  and  global  priority  where 

I  is  the  level  of  the  node  in  the  tree.  When  the  node  resumes  from  suspension  (i.e.,  when  it 
receives  two  returned  values  from  its  two  children),  xc  of  the  node  becomes  a  very  low  number, 
say  — oo.  The  reason  is  that  since  the  grain  size  of  the  node  at  this  moment  becomes  very  small 
(see  the  grain  size  definition  in  Section  4. 1 .2.3),  we  prefer  not  to  move  the  node  to  another 
processor. 


#  processors  (p) 

1 

2 

4 

8 

Total  Time  (T) 

53.88 

27.61 

13.54 

6.84 

Simple  Speedup  (S’) 

1.00 

1.95 

3.97 

7.86 

Simple  Efficiency  ( E') 

1.00 

0.98 

0.99 

0.98 

Table  6.1 :  The  total  times  (in  seconds)  and  simple  speedups/efficiencies  for  parallel  Fibonacci. 

In  our  experiment,  we  computed  F(26)  with  the  GLB  installed  on  a  CAB  so  that  we  may 
ignore  the  GLB  overhead.  The  performance  results  are  shown  in  Table  6. 1 .  Task  creation  and 
scheduling  (i.e.,  task  insertion  and  deletion)  dominate  most  of  the  computation  time  becau.se 
other  operations  require  short  times,  for  the  following  reasons.  The  essential  computation  for 
Fibonacci  consists  of  only  addition  and  assignment,  which  is  insignificant  when  compared  with 
the  overhead  of  task  scheduling.  In  addition,  we  also  observe  little  communication,  i.e..  98 
sends/receives  (about  20  milliseconds  in  total)  for  each  processor.  Since  computing  F(26) 
requires  scheduling  nodes  589252  times  (including  the  times  for  rescheduling  nodes  which 
resume  from  suspension),  the  average  lime  for  one  task  scheduling  and  task  creation  is  about 
91 .4  microseconds  (53.88  seconds  /589252). 
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The  results  for  simple  speedup  also  show  that  the  mechanism  for  task  scheduling  and 
creation  can  be  parallelized  well.  For  example,  the  simple  speedup  for  8  processors  is  as  high 
as  7.86  and  the  simple  efficiency  is  98%. 


6.3  Set  Covering 


Set  covering  is  an  important  application  of  integer  linear  programming  in  the  area  of  mathe¬ 
matical  optimization.  Balas  and  Padberg  [9]  surveyed  many  teal  applications  of  set  covering 
in  industry  and  listed  over  10  types.  Examples  includes  airline  crew  scheduling  [22],  truck 
delivery  [10],  facility  location  [83],  switching  circuit  design  [84],  political  districting  [36],  and 
information  retrieval  [30]. 

The  set  covering  problem  is 

min{cx  |  /lx  >  e,Xj  =  Oor  l.Vj,  I  <  j  <  n} 

where  /I  is  an  m  x  n  matrix  of  zeros  and  ones,  c  is  a  row  vector  of  n  positive  (integer)  weights, 
and  e  is  a  column  vector  of  m  ones.  The  above  definition  can  be  interpreted  as  follows:  Each 
column  Cj  of  A  is  associated  with  a  set  Mj  (associated  with  a  weight  Cj)  with  the  following 

properties:  ( 1 )  Mj  is  a  subset  of  .V/  =  {1 . (2)  each  value  i  is  in  Mj  if  and  only  if  a,j  =  I . 

The  set  covering  problem  is  to  find  a  minimum-weight  family  of  subsets  Mj,  I  <  j  <  n,  which 
covers  ail  elements  of  M.  For  example,  for  airline  crew  scheduling,  M  corresponds  to  the 
set  of  flight  legs  (nonstop  flights  from  one  city  to  another)  to  be  covered,  while  each  sub.set 
Mj  corresponds  to  a  possible  tour  (starting  and  ending  at  the  same  point  via  a  sequence  of 
flight  legs)  for  a  crew.  Let  each  tour  be  associated  with  a  cost.  The  problem  is  then  to  obtain 
a  minimum-cost  collection  of  these  tours  to  cover  all  the  flight  legs  in  M.  According  to  the 
survey  in  [9],  there  are  usually  hundreds  of  possible  tours  and  thousands  of  flight  legs  to  be 
covered.  In  our  experiments,  we  chose  problems  with  200  possible  tours,  and  with  1000  flight 
legs  to  be  covered  (the  same  size  was  commonly  used  in  [8,  35, 41  ]). 

The  set  covering  problem,  as  well  as  other  mathematical  optimization  problems,  usually 
requires  large-scale  branch-and-bound  (B&B)  tree  search  in  order  to  locate  the  optimal  solution. 
We  have  found  that  these  searching  processes  are  highly  dynamic  in  the  sense  that  it  Is  difficult 
to  make  useful  a  priori  estimates  on  the  number  of  nodes  the  search  tree  will  explore  and  the 
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siyfi-t  of  the  nodes.  This  indeterminacy,  plus  the  possible  large  load  fluctuations  in  the  network, 
makes  the  problem  of  solving  set  covering  over  networks  extremely  difflcult  Since  both  D&C 
and  BFS  can  be  used  to  implement  B&B  tree  search,  the  set  covering  problem  is  a  good 
application  example  to  test  both  PBFS  and  PDC  scheduling  algorithms  in  our  experiments. 

The  most  advanced  technique  [8,  35. 411  for  solving  the  set  covering  problem,  as  well  as 
other  mathematical  optimization  problems  [731.  is  based  on  the  following  two  steps; 


1.  Convert  it  to  a  linear  programming  (LP)  problem  and  find  the  optimum  LP  solution,  which 
may  be  a  floating-point  number.  Usually,  a  LP  problem  can  be  solved  very  efficiently. 
For  example,  Karmarkar  [551  proposed  a  polynomial  time  algorithm  to  solve  the  LP 
problem.  In  addition,  Sethi  and  Thompson  also  designed  a  specialized  pivot  and  probe 
algorithm,  called  PAPA  [881,  to  solve  the  LP  problem  efficiently. 

2.  Apply  a  B&B  technique  to  search  for  the  desired  optimal  integer  solution  starling  from 
the  optimal  LP  solution.  Since  the  LP  solution  is  usually  close  to  the  integer  solution, 
the  search  tree  will  be  much  smaller  than  that  without  using  this  technique  of  utilizing 
the  LP  solution. 


The  sequential  set  covering  program  for  our  experiments  was  originally  written  by  Harche 
and  Thompson  [41];  it  was  optimized  by  Wu  et  oL  [106].  This  program  takes  the  LP  result 
generated  by  the  PAPA  package  [88]  and  then  performs  B&B  tree  search'  interactively.  This 
program  is  described  as  follows: 


1.  The  user  chooses  an  upper  bound  and  the  tree  search  sL  itegy,  e.g.,  incomplete/complete 
tree  search  and  depth-/best-first  search.  Incomplete  treij  search  is  allowed  because  in 
real  applications  getting  a  decent  solution  quickly  is  sometimes  better  than  getting  the 
optimal  solution  in  an  enormous  length  of  time. 

2.  The  program  uses  the  chosen  strategy  to  search  for  all  the  solutions  with  costs  less  than 
the  given  upper  bound.  The  program  will  report  the  minimum  cost  of  these  solutions,  if 
one  exists. 

'In  the  B&B  tree  search,  only  one  column  of  data  in  the  set  covering  mauix  is  transferred  between  any  two 
nodes  of  the  B&B  uee. 
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Regarding  parallelization  of  B&B  programs,  researchers  [63, 66]  have  reported  the  speedup 
anomaly  phenomenon  in  which  the  speedup  is  irregular  (sometimes  superlinear  and  sometimes 
sublinear)  when  different  numbers  of  processors  are  used.  Since  the  order  of  expanding  nodes 
may  vary  when  a  different  number  of  processors  is  used,  the  search  may  need  to  expand 
significantly  more  nodes  or  fewer  nodes. 

Since  this  anomalous  phenomenon  would  make  our  performance  analysis  difficult,  we 
choose  cases  in  which  the  phenomenon  does  not  occur.  This  ensures  a  rigorous  standard  in  our 
testing.  For  our  experiments,  we  choose  a  problem  in  which  no  solution  will  be  found  in  the 
search;  thus,  we  would  prune  no  nodes  at  runtime.  Thus,  the  expanded  tree  is  always  the  same 
in  spite  of  different  search  orders.  Note  ihat  the  tree  shape  is  still  very  irregular. 

Based  on  the  above  criterion,  the  experiments  use  four  problems,  called  Problems  1,  2,  3, 
and  4,  of  increasing  sizes.  In  these  four  problems,  the  heights  of  the  search  tree  are  30,  36,  66. 
and  71,  respectively;  the  numbers  of  tree  nodes  are  1 135,  2107,  7097,  and  23767,  respectively. 

In  our  experiments  on  the  set  covering  problem,  we  apply  both  PBFS  and  PDC  scheduling 
algorithms,  described  in  Sections  6.3.1  and  6.3.2.  respectively.  In  Section  6.3.3,  we  show  the 
experimental  results  obtained  by  installing  the  GLB  tree  on  CABs.  In  addition,  we  also  apply 
a  scheduling  algorithm,  called  parallel  hybrid  search  (PHS),  to  the  set  covering  problem  in 
Section  6.3.4.  This  shows  how  easily  we  can  change  to  different  task  scheduling  algorithms  to 
possibly  obtain  better  scheduling. 


6.3.1  PDC 

Table  6.2  lists  the  performance  results  for  Problems  I  through  4  with  the  PDC  scheduling 
algorithm.  The  results  for  speedups  are  also  depicted  in  Figure  6.4.  These  results  show  that  the 
speedup  increases  with  the  problem  size.  In  particular,  on  eight  processors,  the  largest  problem 
(Problem  4)  has  a  very  good  speedup  of  7.7 1 .  This  is  because,  in  general,  the  number  of  nodes 
.V  in  a  tree  grows  exponentially  with  the  tree  height  h.  For  example,  for  a  complete  binary 
tree,  when  h  increases  by  one,  M  is  doubled.  Therefore,  on  the  average,  the  larger  a  problem, 
the  more  independent  tasks  are  available;  so  the  better  the  efficiency. 

In  order  to  understand  the  performance  results  of  Table  6.2,  we  will  examine  the  overhead 
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Table  6.2:  Measured  performance  results  for  the  PDC  scheduling  algorithm  (T:  Time  in 
seconds,  5:  speedup,  and  E\  efficiency). 


Figure  6.4:  Speedups  for  Problems  1  -4  with  PDC. 
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Table  6.3:  Averaged  scheduling  overhead  in  seconds  when  using  the  PDC  scheduling  algorithm. 
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and  the  idle  time  for  Problems  1  and  2  which  are  representative  for  our  analysis.  There 
are  two  main  kinds  of  overhead:  the  overhead  for  scheduling  each  task  locally  (without  any 
communication)  and  the  overhead  for  communication.  Table  6.3  shows  the  average  scheduling 
overhead  on  each  processor  for  Problems  1  and  2.  The  overhead  for  scheduling  each  task 
locally  is  about  80  microseconds  (=  0.09  seconds  /  1 135)  from  the  table.  Since  the  average 
task  computation  time  is  about  15  milliseconds  (which  is  16.01  seconds/1135  tasks),  the 
scheduling  overhead  is  less  than  1%.  Since  we  always  search  the  same  complete  tree,  the 
overhead  for  scheduling  local  tasks  can  be  distributed  over  processors  nearly  evenly. 


height 

1 

2 

■1 

8 

Problem  1 

30 

405 

1154 

Problem  2 

36 

8 

265 

502 

1366 

Table  6.4:  Number  of  sends/receive  pairs  for  the  PDC  scheduling  algorithm. 

To  analyze  the  communication  overhead,  we  measured  the  number  of  sends/receives  for 
Problems  1  and  2;  results  are  shown  in  Table  6.4.  Since  each  send/receive  is  roughly  200 
microseconds  [27],  we  can  roughly  calculate  the  aggregate  communication  overhead  from  the 
table.  For  PDC,  the  ratio  of  the  number  of  sends/receives  to  the  product  of  the  tree  height  and 
the  number  of  processors  is  close  to  a  constant  between  3  and  5.  This  result  is  close  to  the 
theoretical  communication  cost  [105],  which  is  0{ph). 

Recall  the  advanced  global  protocol  in  Section  4. 1.2.3,  which  requires  a  second  broadcast 
during  each  load  balancing  round  if  the  total  load  in  T',  is  small.  In  our  experiments,  we  only 
need  the  second  broadcast  at  most  once  for  each  run.  This  result  is  close  to  the  theory  in  Section 
4. 1.2.3. 

Processors  become  idle  mainly  in  situations  which  lack  parallelism  or  have  long  latency 
for  responding  to  task  requests.  In  our  implementation,  we  use  a  scheme  in  which  a  processor 
schedules  the  next  task  while  executing  the  current  task.  Since  the  task  granularity  is  quite 
large  when  compared  with  the  latency  of  task  request  (each  task  request  requiring  about  3  to  5 
sends/receives),  processors  become  idle  mainly  due  to  lack  of  parallelism.  Note  that  if  the  task 
request  latency  is  very  large  (e.g.,  when  using  a  wide-area  network),  our  system  can  schedule 
several  tasks  at  a  time  so  that  the  latency  can  be  hidden. 

Table  6.5  shows  the  idle  times  for  Problems  1  and  2.  In  general,  the  more  processors  that  are 
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Table  6.5:  Average  idle  times  (in  seconds)  when  using  the  PDC  scheduling  algorithm. 

used,  the  more  likely  they  are  to  become  idle.  For  the  trees  in  the  set  covering  problem  of  our 
experiments,  the  average  branching  factor  (defined  as  is  very  low,  about  1.1  1.3.  So, 

for  example,  when  we  initially  execute  the  part  of  the  tree  near  the  root,  not  much  parallelism 
can  be  exploited.  When  the  number  of  processors  increases,  we  expect  the  maximum  value  of 
the  average  idle  time  to  be  close  to  the  execution  time  of  all  the  nodes  on  the  critical  path  from 
the  root  to  the  leaves.  Since  the  numoer  of  nodes  on  the  critical  path  (about  0(h))  is  small 
compared  with  the  total  number  of  nodes  (iV),  the  idle  time  for  the  PDC  scheduling  algorithm 
can  be  negligibly  small. 


63.2  PBFS 
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Table  6.6:  Measured  performance  results  for  the  PBFS  scheduling  algorithm  (T:  Time  in 
seconds,  5:  speedup,  and  E:  efficiency). 
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Table  6.6  lists  the  measured  performance  for  Problems  1  through  4  for  the  PBFS  scheduling 
algorithm.  The  results  for  speedups  are  also  depicted  in  Figure  6.5.  Note  that  the  PBFS  results 
are  not  as  good  as  the  PDC  results.  This  is  mainly  because  in  Section  5.1,  the  PDC  scheduling 
algorithm  employs  a  provably  minimum  number  of  interprocessor  communication  messages. 
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Table  6.7:  Measured  Number  of  sends/receives  when  using  the  PBFS  scheduling  algorithm. 

For  PBFS,  the  idle  time  and  the  scheduling  overhead  are  very  similar  to  those  for  PDC 
and  their  presence  is  for  the  same  reasons  discussed  before.  So,  here  we  will  only  discuss  the 
communication  overhead  for  PBFS.  Table  6.7  shows  the  measured  numbers  of  send/receive 
pairs  for  Problems  1  and  2.  The  numbers  for  PBFS  are  much  larger  than  those  for  PDC.  This  is 
because  in  PBFS  whenever  a  task  is  scheduled,  the  minimum-cost  task  needs  to  be  recomputed 
across  all  processors.  As  mentioned  above,  the  communication  cost  for  PBFS  is  0(  N).  If  we 
check  each  column,  the  ratios  of  the  amount  of  communication  to  the  number  of  nodes  are 
nearly  constant.  But,  when  we  use  more  processors,  the  ratios  go  up.  This  is  because  in  a 
realistic  situation  we  need  more  control  packets  to  help  send  a  cross  node  out  when  using  more 
processors.  However,  our  system  only  lets  the  ratio  grow  from  0.7  (for  one  processor)  to  1.7 
(for  8  processors). 
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Figure  6.6;  Efficiencies  with  PDC  (installing  GLB  on  CABs) 


Figure  6.7:  Efficiencies  with  PBFS  (installing  GLB  on  CABs) 
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633  Global  Load  Balancer  on  CABs 


In  this  section,  we  will  investigate  the  effect  of  installing  the  global  load  balancer  (GLB)  tree 
on  CABs.  When  the  GLB  is  installed  on  CABs,  the  GLB  work  is  “off-loaded”  from  hosts  to 
CABs  and  therefore  applications  should  get  better  performance. 

Figure  6.6  (6.7)  shows  efficiencies  of  the  PDC  (PBFS)  scheduling  algorithms  for  Problems 
1  and  2.  The  dashed  lines  are  results  obtained  when  we  install  the  GLB  tree  on  CABs;  the 
solid  lines  are  the  same  results  as  those  in  Sections  6.3.1  and  6.3.2,  which  are  obtained  when 
we  install  the  GLB  tree  on  hosts. 

The  performance  data  indicate  that  we  can  improve  the  efficiency  by  about  2%  to  20%  for 
PDC  and  5%  to  20%  for  PBFS  by  installing  the  GLB  on  CABs.  The  improvement  is  small 
because  the  task  grain  size  (about  15  millisecond)  is  large  when  compared  with  the  time  for  each 
send/receive  (about  200  microseconds).  In  general,  installing  GLB  on  CABs  will  help  more 
for  those  applications  requiring  heavy  communication,  such  as  problems  with  small-grained 
tasks.  Our  results  also  show  that  we  can  save  more  time  for  PBFS  (which  requires  more 
communication)  than  for  PDC. 


Figure  6.8:  Performance  results  for  parallel  hybrid  search. 
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6  J.4  Parallel  Hybrid  Search 


In  the  set  covering  problem,  the  computation  tree  is  usually  lopsided.  A  node  with  a  smaller 
cost  usually  has  a  bigger  subtree.  So,  we  can  change  PDC  by  assigning  to  the  node  =  -c, 
where  c  is  the  cost  of  the  node.  This  is  called  parallel  hybrid  search  (PHS)  in  the  sense  that 
it  behaves  like  PDC  locally  (on  each  processor)  while  it  behaves  like  PBFS  globally  (over  the 
whole  system). 

The  performance  results  for  Problems  1  and  2  with  PHS  are  shown  in  Figure  6.8.  We 
observe  that  PHS  is  slightly  better  than  PDC,  by  at  most  a  10%  margin.  Most  importantly, 
this  experiment  demonstrates  how  easily  we  can  change  our  scheduling  algorithm  to  obtain 
possibly  better  results. 


6.4  Summary 


We  summarize  our  experimental  results  as  follows: 


•  The  average  overhead  for  scheduling/creating  one  local  task  can  be  as  low  as  80  mi¬ 
croseconds.  The  scheduling  overhead  can  be  parallelized  well  too. 

•  For  the  set  covering  problem,  we  generally  get  good  speedups  for  both  PBFS  and  PDC. 
For  a  specific  large  problem  requiring  less  than  500  seconds,  we  observed  a  speedup  of 
7.71  with  PDC  and  7.33  with  PBFS  on  8  processors. 

•  The  number  of  sends/receives  for  PDC  is  about  O(pli)  while  that  for  PBFS  is  about 
0{N).  So,  PDC  can  be  scaled  up  well.  For  PBFS,  a  high-speed  network  can  help 
improve  performance  significantly. 

•  For  PDC  and  PBFS,  the  maximum  value  of  the  average  idle  time  can  be  close  to  the 
execution  time  of  all  the  nodes  on  the  critical  path  from  the  root  to  leaves.  Since  the 
number  of  the  nodes  on  the  critical  path  (h)  is  small  compared  with  the  total  number  of 
nodes  (N),  the  idle  time  for  the  two  scheduling  algorithms  can  be  negligibly  small. 

•  We  can  “off-load”  the  computation  of  the  GLB  to  network  coprocessors,  if  they  exist. 
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•  Although  parallel  hybrid  search  does  not  improve  performance  significantly,  it  serves  to 
demonstrate  how  easily  we  can  change  our  scheduling  algorithm  to  obtain  possibly  better 
results.  This  point  is  especially  important  for  parallel  programmers  who  come  up  with 
many  possible  scheduling  algorithms  and  want  to  compare  them  empirically. 


II6 


Chapter  7 


Conclusions 


7.1  Summary 


In  this  thesis,  we  have  proposed  a  general  parallel  programming  model,  called  the  multilist 
scheduling  model,  which  decomposes  task  scheduling  into  (1)  the  specification  of  scheduling 
policies  and  (2)  the  implementation  of  supportive  scheduling  operations  (e.g.,  routines  for 
maintaining  task  lists  and  handling  rnterprocessor  communication  for  load  balancing),  and  then 
hides  the  latter  details  from  the  programmer.  This  model  is  based  on  a  uniform  scheduling 
model  involving  the  use  of  multiple  scheduling  lists.  The  system  has  the  following  advantages: 


1 .  Ease  of  use.  Under  this  new  model,  programmers  only  need  to  specify  scheduling  policies 
based  on  scheduling  lists  in  order  to  implement  scheduling  algorithms;  they  do  not  need 
to  write  the  details  of  supportive  scheduling  routines.  In  fact,  the  supportive  scheduling 
routines  are  the  most  difficult  and  time-consuming  part  to  write.  Typically  they  require 
thousands  of  lines  of  code  in  C.  This  was  the  case  in  our  earlier  experience  [3 1 ,  62]  in 
parallelizing  Noodles,  a  solid  modelingjpiilgaafa  ^3nonths  to  write  the  load 
balancing  part!  In  sharp  contrast,  the  code  for  the  PDC-WK  and  PBFS-GPQ  scheduling 
.  algorithms  (shown  in  Appendix  A.2)  only  has  about  10-20  lines.  A  program  of  this  size 
can  be  written  within  tens  of  minutes. 
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2.  Generality.  We  have  shown  that  our  new  model  results  in  no  loss  of  generality  with 
respect  to  the  standard  scheduling  model  (see  Chapter  2).  That  is,  we  can  recast  any 
scheduling  algorithm  in  terms  of  our  multilist  scheduling  model.  We  have  illustrated 
the  generality  of  the  model  by  implementing  nine  scheduling  algorithms  (in  Chapter  3) 
based  on  the  model  —  two  for  parallel  divide-and-conquer  and  two  for  parallel  best-first 
search,  two  for  parallel  network  simulation,  one  for  parallel  quicksort,  one  for  parallel 
loops,  and  one  for  parallel  alpha-beta  search. 


3.  Efficiency. 


•  We  have  shown  that  our  general  approach  incurs  no  significant  performance  over¬ 
head  at  least  for  the  scheduling  algorithms  for  parallel  BPS  and  D&C.  Although 
the  ultimate  goal  would  be  to  show  that  our  general  approach  incurs  no  significant 
overhead  for  any  scheduling  algorithm,  this  goal  appears  to  be  impossible  to  meet. 
Our  limited  success,  however,  is  still  significant.  In  the  past,  it  was  unclear  how 
to  support  scheduling  algorithms  for  both  parallel  BPS  and  D&C  in  a  uniform 
framework  [34].  We  believe  our  system  is  the  first  that  can  do  so. 

•  We  have  devised  an  efficient  technique  to  cope  with  the  main  problem  of  task 
prioritization  which  arises  when  there  is  a  sparse  distribution  of  priorities.  When  few 
tasks  have  the  same  priority,  it  would  be  inefficient  to  simply  perform  load  balancing 
by  considering  only  those  tasks  that  have  the  highest  priority.  Our  approach  is  to  use 
a  novel  algorithm,  called  the  parallel  range  selection  (PRS)  algorithm  (see  Section 
5.2),  to  try  to  select  additional  highest-priority  tasks,  such  that  the  total  computation 
time  of  these  tasks  is  comparable  to  the  maximum  overhead  for  load  balancing. 
Then,  we  schedule  tasks  from  this  set  in  order  to  balance  the  load.  In  Section  5.2, 
we  proved  that  the  PRS  algorithm  only  requires  one  combining  and  disseminating 
operation  (defined  in  Section  4. 1.2.2),  and  each  packet  size  in  the  algorithm  is  only 
O(log'p),  where  p  is  the  number  of  processors.  This  results  show  that  the  PRS 
algorithm  performs  very  efficiently  on  network-based  multicomputers. 

•  We  have  obtained  good  experimental  results.  Por  example,  for  a  specific  set  covering 
problem  requiring  less  than  500  seconds,  we  obtain  a  speedup  of  7.71  for  parallel 
D&C  and  7.33  for  parallel  BPS  on  Nectar  with  8  processors. 
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7.2  Contributions 


The  main  contribution  of  this  thesis  is  a  new  approach  for  parallel  task  scheduling,  multilist 
scheduling,  which  is  easy  to  use,  general,  and  efficient.  This  new  approach  successfully 
decomposes  the  task  scheduling  process  into  the  specification  of  scheduling  policies  and  the 
supportive  scheduling  routines  such  that  the  scheduling  programmers  can  focus  on  designing 
efficient  scheduling  algorithms,  while  the  system  designers  can  focus  on  designing  efficient 
supportive  routines. 

While  proposing  this  new  model,  we  also  make  the  following  contributions: 


•  We  develop  some  multilist  scheduling  schemes  to  implement  nine  scheduling  algorithms 
—  two  for  parallel  divide-and-conquer  and  two  for  parallel  best-first  search,  two  for 
parallel  network  simulation,  one  for  parallel  quicksort,  one  for  parallel  loops,  and  one  for 
parallel  alpha-beta  search. 

•  We  show  that  the  multilist  scheduling  model  results  in  no  loss  of  generality  with  respect 
to  the  standard  scheduling  model  (see  Chapter  2). 

•  We  present  an  efficient  scheduling  algorithm  for  parallel  D&C  and  prove  that,  among  all 
the  scheduling  algorithms  which  can  split  the  load  nearly  evenly,  our  algorithm  is  optimal 
with  respect  to  the  communication  cost. 

•  We  design  an  efficient  PRS  algorithm  for  the  situation  of  sparse  priority  distribution  and 
prove  that  the  algorithm  only  needs  one  combining  and  then  one  disseminating  operation 
and  each  packet  size  is  only  0(log‘  p),  where  p  is  the  number  of  processors. 

•  We  design  an  efficient  data  structure  for  the  operations.  Insert,  Delete,  Maxpri, 
DeleTEMAX,  ThreshprI,  and  Split  (see  Section  4.2).  We  also  prove  that  the  computation 
times  for  all  the  above  operations  are  O(logn),  where  n  is  the  number  of  distinct 
priorities  in  the  priority  queues,  and  the  amortized  times  for  the  above  operations  which 
access/insert/delete  a  task  with  the  highest  or  lowest  priority  are  0(1). 

•  We  demonstrate  good  performance  results  for  the  scheduling  algorithms  for  parallel  BPS 
and  D&C. 
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Multilist  scheduling  is  the  first  approach  which  can  hide  the  details  of  supportive  scheduling 
routines  while  simultaneously  supporting  general  task  scheduling.  We  expect  that  this  thesis 
will  have  a  significant  impact  on  future  parallel  programming,  especially  in  the  domain  of 
multicomputers. 

13  Future  Work 

The  above  research  can  be  extended  in  the  following  directions. 


•  Apply  the  model  to  more  applications,  such  as  operations  research  problems  (e.g.,  the 
traveling  salesman  problem  [76]),  alpha-beta  search  problems,  network  simulation  prob¬ 
lems,  scientific  problems,  and  any  other  interesting  problems. 

•  Formalize  a  language/interface.  This  also  includes  two  important  parallel  programming 
issues:  handling  global  variables  [14,  15]  and  calling  remote  procedures  [7,  13,  54,  93, 
96,  102]  over  distributed-memory  systems. 

•  Add  fault  tolerance  to  the  system.  It  is  especially  important  for  a  network  to  tolerate 
processor  failure.  For  example,  a  remote  workstation  may  be  turned  off  unexpectedly. 

•  Develop  tools  on  top  of  this  package  for  ddbfSgfogi26ce  monitoring  [19], 
and  graphical  development  [12]. 

•  Port  our  model  to  other  parallel  systems,  e.g.,  CM5  [99]  and  iWarp  [18];  also  port  the 
programming  system  to  other  network  interfaces,  e.g.,  PVM  and  sockets  of  TCP/IP. 
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Appendix  A 


User  Interface 


The  current  multilist  scheduling  system  is  operational  at  CMU.  It  has  a  C  language  interface. 
This  interface  is  presented  in  Section  A.l  in  enough  detail  to  illuminate  the  code  of  the  PBFS- 
GPQ  and  PDC-WK  scheduling  algorithms,  which  will  be  presented  in  Section  A.2. 


A.l  Interface  Definitions 


To  distinguish  our  interface  from  system  calls  or  variables,  we  prefix  each  function  name  with 
“MLS_”  (MultiList  Scheduling). 


A.  1.1  Initializing  the  Multilist  Scheduling  System 

void  MLS_Init  (spec,  pkey_type,  min_grain) ; 

The  MLS_Init  ( )  procedure  initializes  the  multilist  scheduling  system. 

The  parameter  spec  is  the  name  of  a  user-defined  routine  which  specifies  the  scheduling 
pattern,  i.e.,  the  declaration  ofPLs  and  the.merge  patterns  for  VLs  (all  oi  which  will  be  defined 
later).  In  the  current  implementation,  the  routine  can  only  be  executed  inside  MLS_Init  ( ) . 


121 


The  parameter  min_grain  is  an  integer,  representing  the  minimum  task  grain  size  used 
in  the  computation.  The  system  currently  assumes  that  all  tasks  have  the  same  grain  size. 
This  parameter  provides  the  task  grain  size  so  that  the  system  can  easily  set  up  some  system 
parameters  for  load  balancing.  In  our  current  system,  since  the  overhead  for  task  scheduling 
and  creation  is  already  80  microseconds  (as  shown  in  Section  6.3)  on  a  Sun4/330,  we  let  one 
unit  of  grain  size  stand  for  100  microseconds  on  Sun4/330s. 

The  parameter  pkey_type  determines  the  type  of  priority  keys  used  in  the  system.  The 
choices  are  integer,  string,  bit  string,  etc.  But,  since  the  integer  type  is  most  common,  we  only 
support  the  integer  type  (MLS_INT_KEY)  in  the  current  implementation.  The  work  in  [85] 
uses  bit  string  as  the  key  type. 

Since  our  system  is  in  the  SPMD  (single  program,  multiple  data)  style  [56],  the  program 
is  replicated  on  each  processor.  Each  instance  of  the  program  (on  a  distinct  processor)  must 
explicitly  call  MLS_Ini  t  ( )  before  calling  any  other  functions  which  are  described  below  for 
the  multilist  scheduling  system. 


A.1.2  Physical  Lists 


There  are  two  kinds  of  physical  lists  (PLs):  base  PLs  and  derived  PLs. 


MLS_List_p  MLS_Base_PL  (list_naine,  list_type, 

lb,  ub,  Indiv_Range) ; 

The  ]yiLS_Base_PL  ( )  procedure  creates  a  base  PL  with  the  name  list_name,  the  type 
list_type,  the  expected  priority  range  between  lb  and  ub  (inclusive)  and  the  indivisible 
function  Indiv_Range,  and  then  returns  a  pointer  to  the  PL  with  type  MLS_List_p.  Note 
that  the  word  list  in  our  interface,  by  convention,  represents  PL. 

The  parameters  lb  and  ub  respectively  specify  the  lower  bound  and  upper  bound  of  the 
priority  range  in  this  PL.  If  the  bounds  are  not  known  a  priori,  the  user  can  use  the  complete 
priority  range,  whoseTower  bound  and  upper  bound  are  respectively  MLS_MIN_PRI  and 
MLS_MAX_PRI,  defined  by  the  system. 
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The  parameter  Indiv_Range  is  some  user-defined  function  IR  (pri ) :  given  pri,  the 
function  IR  ( )  returns  a  priority  range  containing  the  priority  pri.  The  PL  can  use  this  function 
to  determine  if  a  new  maximum  priority  is  out  of  the  original  priority  range.  If  so,  the  system 
may  need  to  report  to  another  processor.  (See  Section  4.1.1.)  If  the  parameter  Indi  v_Range 
is  null,  this  implies  that  each  priority  is  a  distinct  indivisible  priority  range.  If  the  PL  will  not 
be  merged  into  any  VL  on  another  processor,  the  parameter  Indiv_Rauige  has  no  effect. 

We  currently  restrict  the  list  type  to  2-4  trees  (see  Section  4.2. 1 )  to  implement  PLs.  But,  in 
the  future,  the  parameter  list_type  may  be  used  to  specify  another  type  of  list,  in  case  we 
discover  some  type  of  list  which  is  more  efficient  for  some  cases. 

If  this  procedure  is  called  for  the  jth  time  on  processor  P,,  the  created  PL  is  designated 
PLij.  In  the  current  version,  we  simplify  operations  by  assuming  that  all  jth  lists  PL.j  have 
the  same  list  name  and  have  the  same  parameters.  So,  we  do  not  need  communication  to  set  up 
associated  links  (in  the  scheduling  pattern). 


MLS_List_p  MLS_Derived_PL  {list_naine,  base,  func,  inv 

lb,  ub,  Indiv_Range) ; 

The  MLS_Derived_PL  ( )  procedure  creates  a  derived  PL  which  is  based  on  a  base 
PL  base  with  the  priority  translation  function  func,  whose  inverse  function  is  inv.  The 
parameters  list_name,  lb,  ub,  and  Indiv_Range  are  the  same  as  these  for  base  PLs.  If 
this  PL  is  not  part  of  a  global  scheduling  subpattern,  the  inverse  function  inv  will  not  be  used. 
Note  that  this  operation  for  derived  PLs  has  not  yet  been  implemented. 


A.1.3  Merging  Physical  Lists  into  Virtual  Lists 

In  the  current  implementation,  we  provide  the  following  functions  to  specify  how  to  merge  PLs 
into  VLs: 


void  MLS_Merge  (list_naine,  proex)  ; 
void  MLS_Merge_Local  (list_name); 
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Assume  that  the  procedure  MLS_Merge( )  is  executed  on  a  processor  P,.  Then,  this 
procedure  specifies  that  a  PL  named  list_ncune  on  processor  Pj  will  be  merged  into  VL,  ac¬ 
cording  to  the  standard  protocol,  where  )  =  procx.  Ifj  =  i,  we  can  use  MLS_Merge_Local 
instead.  In  any  case,  a  system  variable,  called  MLS_this_proc  (=  ?).  is  provided  to  refer  to 
the  current  processor  P,  . 

void  MLS_Merge_All  ( list_ncime)  ; 

Assume  that  the  procedure  MLS_Merge_All  { )  is  executed  on  a  processor  P,.  Then, 
this  procedure  specifies  that  all  the  PLs  named  list_riame  (over  the  whole  system)  will  be 
merged  into  the  VL  on  processor  P,.  If  every  processor  calls  the  same  procedure,  the  system 
can  apply  the  advanced  global  protocol  to  these  PLs. 

void  MLS_Merge_Dyncimic  (list_name,  pset)  ; 

Assume  that  the  procedure  MLS_Merge_IDynamic  ( )  is  executed  on  a  processor  P, .  This 
procedure  specifies  that  ail  the  PLs  named  list_naine  in  a  set  S  of  processors  will  be  merged 
into  the  VL  on  processor  P,,  but  the  processor  set  5  can  be  dynamically  determined  as  follows: 
Whenever  processor  P,  tries  to  schedule  a  task  and  needs  to  check  the  PLs  named  1  i  s  t_name. 
processor  P  will  execute  the  function  pset  ( )  again  to  obtain  the  new  set  5,  and  then  find  the 
highest  priority  among  PLs  on  the  processors  in  5. 

Let  us  consider  the  example  of  PDC-RR  scheduling  algorithm.  The  programmer  can  call 
this  function,  as  follows. 

MLS_Merge_Dynainic  ( "GL" ,  rr_pset)  ; 

The  function  rr_pset  ( ) ,  provided  by  the  programmer,  returns  the  next  processor  in  a  round- 
robin  fashion.  More  specifically,  when  called,  this  function  lets  the  variable  .s  =  (  s  mod  p)  +  1 
and  returns  5  so  that  the  system  knows  that  the  processor  set  includes  processor  P^.  Since 
the  programmer  also  needs  to  specify  the  priority  ranges  of  GL  and  LL  such  that  the  system 
knows  that  each  priority  in  LL  is  no  less  than  that  in  GLs,  the  processor  calls  the  function  and 
schedules  a  task  from  a  GL  only  when  there  are  no  local  tasks.  Thus,  the  system  will  perform 
as  described  in  [34, 44,  82]. 
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A.1.4  Priority  Assignment 


void  MLS_Task_Pri  (PL,  T,  pri) ; 
void  MLS_Change_Pri  (PL,  T,  pri) ; 

The  procedure  MLS_Task_Pri  ( ) ,  assumed  to  be  called  on  processor  P,,  inserts  task 
T  into  the  PL  PL  on  processor  P,  according  to  the  priority  pri.  But,  if  the  value  of  pri 
is  MLS_UNDEF_PRI,  the  system  will  not  insert  the  task  into  the  PL.  Note  that  the  current 
implementation  does  not  support  the  feature  of  creating  a  task  on  another  processor. 

Similarly,  the  MLS_Change_Pri  ( )  procedure,  assumed  to  be  called  on  processor  P,, 
changes  the  priority  of  task  T  to  pri. 


A.2  Examples 


Using  the  above  interface  definitions,  we  will  illustrate  the  code  for  the  PDC-WK  and  PBFS- 
GPQ  scheduling  algorithms,  below. 
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/***  following  are  in  the  file  PBFS.h  **•/ 
extern  void  PBFS_Schd();  /*  scheduler  for  PBFS  •/ 

MLS_List_p  LT;  /*  scheduling  list  */ 

/*  interface  for  the  prograinmer  writing  BFS  application  code.  •/ 

♦define  DECLARE_NODE(T, c)  MLS_Task_Pri (LT.T, -c) 

♦define  INIT_PBFS(g)  MLS_Init (PBFS_Schd.  g.  INTKEY) 

/**•  following  are  in  the  file  PBFS.c  •**/ 
void  PBFS_Schd  ()  { 

/*  create  a  PL  */ 

LT  =  MLS_Base_PL  CLT",  MLS_INT_KEY.  MLS_MIN_PRI,  MLS_MAX_PRI,  NULL) ; 

/*  merge  all  PLs  (named  *LT*)  in  the  entire  system  •/ 

MLS_Merge_All ( "LT" ) ; 

} 


Figure  A.l:  The  code  for  the  PBFS-GPQ  scheduling  algorithm. 

/*•*  following  are  in  the  file  PDC.h  **•/ 
extern  void  PDC_Schd();  /*  scheduler  for  PDC  */ 

MLS_List_p  LL,  GL;  /*  local  euid  global  scheduling  lists  •/ 

/*  interface  for  the  programmer  writing  D&c  application  code.  */ 
♦define  DECLARE_NODE(T, 1)  \ 

{  MLS_Task_Pri(T,LL, 1) ;  MLS_Task_Pri (T, GL. -1 ) ;  } 

♦define  INIT_PDC()  MLS_Init ( PDC_Schd,  DC_GRAIN,  INTKEY) 

/***  following  are  in  the  file  PDC.c  ***/ 
void  PDC_Schd  0  { 

/*  create  two  PLs  */ 

LL  =  MLS_Base_PL  ("LL",  MLS_INT_KEy,  0,  MLS_MAX_PRI,  NULL) ; 

GL  =  MLS_Base_PL  ("GL",  MLS_INT_KEY,  MLS_MIN_PRI.  0,  NULL) ; 

/*  merge  the  local  LL  and  all  GLs  */ 

MLS_Merge_Local  ( "  LL " )  ,- 
MLS_Merge_All ( "GL" ) ; 

) 


Figure  A.2;  The  code  for  the  PDC-WK  scheduling  algorithm. 
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